“It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

— Sherlock Holmes

Week 1: Estimating Paramters

Statistics, or particularly statistical inference, is how we connect our observations of the real world to probabilistic models we construct to model observable phenomena. The term inference encompasses a variety of activities. Last week, we began with the most basic part of inference, estimating the values for our probabilistic models’ parameters. We learned three basic approaches: least-squares, the method of moments, and maximum likelihood. Each of these has its strength and weaknesses, but each can be useful in different situations.

Methods for Estimating Parameters

Least Squares

Method: Minimise the squared point-wise distance between the expected value and the observed data by defining the objective function \[ \begin{align} Q(\theta,\mathbf{y})&= \sum_{i=1}^{n}\left( y_{i} - \operatorname{E}(Y)\right)^2\\ &=\sum_{i=1}^{n}\left( y_{i} - g(\theta)\right)^2\\ \end{align} \] minimising this function with respect to $\theta$ \[ \hat{\theta}_{LS} = \text{arg min}_{\theta} \sum_{i=1}^{n}\left( y_{i} - \operatorname{E}(Y)\right)^2 \] results in the least squared estimator for $\theta$, $\hat{\theta}_{LS}$.

Method of Moments

Method: Substitute the sample moments based on the data for the theoretical moments (as functions of the parameters) and solve for the parameters.

The theoretical moments for the density function $f(x;\theta)$ are \[ \begin{aligned} \mu_1(\theta) & = \int_{-\infty}^{\infty}xf(x;\theta)dx\\ \mu_2(\theta) & = \int_{-\infty}^{\infty}x^2f(x;\theta)dx\\ \vdots & \\ \mu_k(\theta) & = \int_{-\infty}^{\infty}x^kf(x;\theta)dx. \end{aligned} \] The sample moments based on the data are \[ \begin{aligned} m_1 & = & \frac{1}{n}\sum_{i = 1}^nx_i\\ m_2 & = & \frac{1}{n}\sum_{i = 1}^nx_i^2\\ \vdots & & \\ m_k & = & \frac{1}{n}\sum_{i = 1}^nx_i^k. \end{aligned} \] In the univariate case, we define \[ \mu_k=g(\theta) \] and the sample moment $m_k$ then the method of moments estimator for $\theta$ is \[ \hat{\theta}_{MM}=g^{-1}(m_k). \]

Given a vector of parameters $\boldsymbol{\theta}=\theta_1,\theta_2,\ldots,\theta_p$, the corresponding moments \[ \begin{align} \mu_1& = g_1(\mathbf{\theta})\\ \vdots & \\ \mu_p & = g_p(\mathbf{\theta}) \end{align} \] yield a system of equations to solve to find the method of moments estimators.

Maximum Likelihood

Method: Choose the parameter which corresponds to the “most likely” distribution given the data. For a given distribution with probability mass/density function $f(x;\theta )$, from which we assume the data ($x_{i}$) are iid generated, define the likelihood as $L(\theta) = \prod_{i=1}^{n} f(x_{i};\theta )$, the maximum likelihood estimate is: \[ \hat{\theta}_{ML} = \underset{\theta\in\Theta}{\operatorname{argmax}} L(\theta) \] Generally, we maximise \[ \begin{align} l(\theta)&=\log\left( L(\theta)\right)\\ &=\sum_{i=1}^n\log\left(f(x_i;\theta)\right) \end{align} \] as it is more convenient to work with and results in the equivalent estimate. The maximum likelihood estimator (MLE) has several nice properties which we will explore.

Theory questions

Question 1

A random sample of observations, $y_{1},y_{2},\ldots,y_{n}$, are taken from a population whose probability density function is \[ f(y;\theta) = \begin{cases} \theta y^{\theta-1}, & \text{if } 0 < y \leq 1 \\ 0, & \text{otherwise} \end{cases} \] where $\theta > 0$.

Find the mean of a random variable from distribution with pdf $f(y;\theta)$.
Find the method of moments estimator for $\theta$.
What is the MM estimate of $\theta$ if $n=4$ and $(y_{1}, y_{2}, y_{3}, y_{4}) = (0.1, 0.2, 0.33, 0.4)$?

Question 2

Consider a data set with $y_{1}, y_{2}, \ldots, y_{n}$ from a Bernoulli distribution \[ p(y;p)=p^y(1-p)^{1-y} \]

Write the equation for the likelihood function $L(p|y_{1})$ for the first observation only.
Define the vector $\boldsymbol{y}$ as $\mathbf{y} = (y_{1}, y_{2}, \ldots, y_{n})$. Write the equation for the likelihood function for observing all $y_{1}, y_{2}, \ldots, y_{n}$ as $L(p|\mathbf{y})$.
Is the use of the condition (i.e. “$\vert$”) important in the likelihood and pmf? Is the ordering of $\mathbf{y}$ and $p$ important? Explain your answers.
Write the equation for the log-likelihood function $\log(L) = l(p|\mathbf{y})$ and determine the MLE.
Rewrite $l(p|\mathbf{y})$ as a function $\hat{p}_{ML}(s)$ of the statistic $s = \sum_{i=1}^{n} y_{i}$. For what values $s$ is the function defined? What is the range of the function? For what values of $s$ is the MLE (i.e. the maximum) unique?

Question 3

Say you have $n$ pairs of data $x_{i},y_{i}$ which appear to approximately follow a quadratic curve that goes through the origin, $y = b x^2$. Find the least squares estimator of $b$.

Question 4

A researcher is running an experiment using participants who are asked to complete a task within 10 minutes. They keep repeating the task until they fail $k$ times. If $1 - p$ is the unknown probability of success, then the number of successes until the experiment is completed is given by a negative binomial distribution,The pmf of a negative binomial in this situation is \[ p(y|k,p) = \frac{\Gamma(k+y)}{y!~\Gamma(k)}p^{k} (1-p)^y \] for $y=0,1,2,\ldots$.

Let $y_{1},y_{2},\ldots,y_{n}$ be the number of successes for $n$ participants in the experiment. The ethics committee has accepted $k = 5$ as a safe stopping criteria for the experiment.

Determine the method of moments estimator for $p$.
Determine the maximum likelihood estimator for $p$
Compare the two estimators and comment on their similarities.

Practical questions

For these questions, you will need to write functions, plot them, and optimise them. Here’s a quick summary of how to do that:

Question 1: Defining functions in R

Here is an example of defining a function in R. The function squared_plus_z takes two arguments, it squares the first then adds the second. Note that the function is defined so that if only one argument is entered , it assumes that the second is $0$.


      squared_plus_z <- function(x, z = 0)      # z is an argument with default value 0
        {
          result <- x^2 + z
          return(result)
        }
    
      # applying to a vector works because the contents 
      # of the function are already vectorised
      squared_plus_z( c(1, 11, 101), z = 1)
    
      # you can even pipe if you want to
      
      c(1, 11, 101) %>% squared_plus_z(z = 1)

Now write a function to for the log-likelihood of a sample $\mathbf{x}=x_1,x_2,\ldots,x_n$ from a Bernoulli distribution with parameter $p$. Note that the function will take both $p$ and $\mathbf{x}$ as arguments.


l<-function(p,x)
  {
  
  }

l<-function(p,x)
  {
  lp <- log(p)
  lp1 <- log(1-p)
  ans <- x*lp+(1-x)*lp1
  ans <- sum(ans)
  return(ans)
}

Question 2: Plotting functions in ggplot

gpplot2 is a powerful graphic package in R that we can use to create various high-quality graphics. Here is an example of plotting the function “squared_plus_z()” we created previously.


 dummy_data <-  tibble(x = c(-3,3))
    
      ggplot(data = dummy_data, aes(x = x) ) + 
      stat_function(fun = squared_plus_z, args = list(z = 1)) +
      ylab(expression("f(x)="*x^2+1))

Now, plot the function for the log-likelihood that you created in the previous exercise for $0<p<1$.

x<-c(0,1,1,0,0,0,0,1,0,0)

Optimising functions with `optim`

The optim function is just one example of optimisation in R. Here we minimise $x^2 - 1$ over $x \in [-1,1]$. Several methods in optim require you to specify an interval (i.e. specify a bounded problem), but some don’t. method = "Brent" is only for one-dimensional optimisation.


      squared_plus_z <- function(x, z = 0)      # z is an argument with default value 0
        {
          result <- x^2 + z
          return(result)
        }
    
      # applying to a vector works because the contents 
      # of the function are already vectorised
      squared_plus_z( c(1, 11, 101), z = 1)
#> [1]     2   122 10202
    
      # you can even pipe if you want to
      
      c(1, 11, 101) %>% squared_plus_z(z = 1)
#> [1]     2   122 10202
    
      
      # initial value for solver
      init_val <- 1
      
      opt_res <- optim(par = init_val, 
            fn = squared_plus_z,
            z = -1,
            method = "Brent",
            lower = -1,
            upper = 1
            )
      
      # optima and value
      round(c(opt_res$par, opt_res$value),2)
#> [1]  0 -1
      
      # optim minimises by default.
      # To change use argument:
      control = list(fnscale = -1)

Now, find the MLE of $p$ for using the log-likelihood function you created in the previous exercises.

x<-rbinom(10,1,0.4)

Question 3

Banjo is a St. Bernard. He is 80kg and not easily kept out of water.

Suppose you would like to model how quickly St. Bernards grow from 0 to 12 months old and you have been given some data on age and weight of St. Bernards from a tutor who is very interested in large dogs. The data is contained in the R chunk below.

age_months <- c(0.4, 2.3, 2.9,  4.5,  4.9,  7.9,  10.1, 11.5)
weight_kgs <- c(1.8, 7.1, 17.0, 20.4, 36.1, 55.2, 58.4, 77.1)

Plot the data in R.
Assuming their weight increases linearly, numerically estimate the model \[ \text{weight}_{i} = \alpha + \beta~\text{age}_{i} \] using the least squares method by following the steps. Fix $\alpha = 1$, that is St. Bernards are assumed to be 1kg when born.
Define the least squares function to be optimised. It should take the parameter(s) and data as arguments.

Optimise the least squares function using optim, try using method = "BFGS".
Plot the least squares fit against the data. You will need to extract the fit from the result of optim. Try using object_name$par.
Compare your answer to the linear model function (lm) in R. Why are the values of $\beta$ different? Based on this, what should you be cautious about when performing estimation on parameters?

Solution:


    library(dplyr)
    library(ggplot2)
  
    age_months <- c(0.4, 2.3, 2.9,  4.5,  4.9,  7.9,  10.1, 11.5)
    weight_kgs <- c(1.8, 7.1, 17.0, 20.4, 36.1, 55.2, 58.4, 77.1)
  
    # (a) Plot data
    bernards <- tibble(
      age = age_months, 
      weight = weight_kgs
    )
  
    puppy_plot <- ggplot(data = bernards) + 
      geom_point(aes(x = age, y = weight)) +
      scale_y_continuous("Weight (kg)") +
      scale_x_continuous("Age (months)") +
      ggtitle("Weight vs age of St. Bernard puppies") +
      theme_bw()
    
    #puppy_plot
    
    # (b) Estimate model
    
    puppy_alpha <- 1
    
    # function to be minimised:
    puppy_sum_of_squares <- function(beta){
      
      # constants, read from enclosing environment
      alpha <- puppy_alpha
      x <- age_months
      y <- weight_kgs
      # sum of squares
      ss <- sum( ( y - (alpha + beta * x) )^2 )
      return(ss)
    }
  
    puppy_res <- optim(par = 0, # start value
                       fn = puppy_sum_of_squares,
                       method = "BFGS"
                       )
    
    puppy_beta <- puppy_res$par
    
    # because the SS is a convex function, we should be 
    # confident (after observing convergence) that the
    # result is the global minimum.
    
    # (c) plot fit
    puppy_ls_fit <- function(x){
      
      puppy_alpha + puppy_beta * x
      
    }
    
    puppy_plot + 
      stat_function(fun = puppy_ls_fit, colour = "blue")
    
    # (d) lm fit comparison
    
    puppy_lm <- lm(weight ~ age, data = bernards)
    
    coef(puppy_lm)
    # intercept (alpha) is estimated to be negative, 
    # this doesn't make physical sense given the data.
    # Need to be careful about the domain of parameters.
    
    puppy_lm_fit <- function(x){
      
      coef(puppy_lm)[1] + coef(puppy_lm)[2] * x
      
    }
    
    puppy_plot + 
       # Blue: ls fit with alpha = 1
      stat_function(fun = puppy_ls_fit, colour = "blue", xlim = c(0,12)) +
      # Red: lm fit (ls as well) with alpha unconstrained.
      stat_function(fun = puppy_lm_fit, colour = "red", xlim = c(0,12)) 
    
  # When the intercept is unconstrained, as in lm(...) 
  # it fits a negative value for the intercept, which isn't
  # physically possible. Domain knowledge is important for stats.

Supplementary Questions

Question 4

For the gamma distribution \[ f(x;\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}e^{-\beta x} \] Given the data:

x<-c(2.3905022, 3.4975296, 2.6559558, 1.6964185, 1.4185724, 4.7956602, 3.0188619, 
     1.9287381, 3.7493438,1.4089986, 0.4889999, 1.7204471, 4.0426991, 2.4519901, 
     1.8883987, 2.0412268, 1.9829191, 0.6888060, 3.4596279, 2.5985436)

Find the Least Squares Estimators for $\alpha$ and $\beta$ where \[ \operatorname{E}(X)=\frac{\alpha}{\beta} \] is this possible? Why or why not?
Find the method of moments estimators for $\alpha$ and $\beta$ where \[ \operatorname{E}(X)=\frac{\alpha}{\beta} \] and \[ \operatorname{Var}(X)=\frac{\alpha}{\beta^2} \]
Find the maximum likelihood estimators for $\alpha$ and $\beta$, compare them to the method of moments estimators.
Plot the log-likelihood function. Create two plots one where you set $\beta=\hat{\beta}$ (the MLE of $\beta$) and plot the log-likelihood for $\alpha$, and one where you set $\alpha=\hat{\alpha}$ and plot the log-likelihood for $\beta$.

(BONUS: See if you can create a 3-D plot of the log-likelihood for both parameters.)

Week 1: Estimating Paramters

Methods for Estimating Parameters

Least Squares

Method of Moments

Maximum Likelihood

Theory questions

Question 1

Question 2

Question 3

Question 4

Practical questions

Question 1: Defining functions in R

Question 2: Plotting functions in ggplot

Optimising functions with optim

Question 3

Supplementary Questions

Question 4

MXB341 Worksheet 1 - Estimators

Optimising functions with `optim`