MXB341 Statistical Inference
“It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
— Sherlock Holmes
Week 1: Estimating Paramters
Statistics, or particularly statistical inference, is how we connect our observations of the real world to probabilistic models we construct to model observable phenomena. The term inference encompasses a variety of activities. Last week, we began with the most basic part of inference, estimating the values for our probabilistic models’ parameters. We learned three basic approaches: least-squares, the method of moments, and maximum likelihood. Each of these has its strength and weaknesses, but each can be useful in different situations.
Methods for Estimating Parameters
Least Squares
Method: Minimise the squared point-wise distance between the expected value and the observed data by defining the objective function \[ \begin{align} Q(\theta,\mathbf{y})&= \sum_{i=1}^{n}\left( y_{i} - \operatorname{E}(Y)\right)^2\\ &=\sum_{i=1}^{n}\left( y_{i} - g(\theta)\right)^2\\ \end{align} \] minimising this function with respect to \(\theta\) \[ \hat{\theta}_{LS} = \text{arg min}_{\theta} \sum_{i=1}^{n}\left( y_{i} - \operatorname{E}(Y)\right)^2 \] results in the least squared estimator for \(\theta\), \(\hat{\theta}_{LS}\).
Method of Moments
Method: Substitute the sample moments based on the data for the theoretical moments (as functions of the parameters) and solve for the parameters.
The theoretical moments for the density function \(f(x;\theta)\) are \[ \begin{aligned} \mu_1(\theta) & = \int_{-\infty}^{\infty}xf(x;\theta)dx\\ \mu_2(\theta) & = \int_{-\infty}^{\infty}x^2f(x;\theta)dx\\ \vdots & \\ \mu_k(\theta) & = \int_{-\infty}^{\infty}x^kf(x;\theta)dx. \end{aligned} \] The sample moments based on the data are \[ \begin{aligned} m_1 & = & \frac{1}{n}\sum_{i = 1}^nx_i\\ m_2 & = & \frac{1}{n}\sum_{i = 1}^nx_i^2\\ \vdots & & \\ m_k & = & \frac{1}{n}\sum_{i = 1}^nx_i^k. \end{aligned} \] In the univariate case, we define \[ \mu_k=g(\theta) \] and the sample moment \(m_k\) then the method of moments estimator for \(\theta\) is \[ \hat{\theta}_{MM}=g^{-1}(m_k). \]
Given a vector of parameters \(\boldsymbol{\theta}=\theta_1,\theta_2,\ldots,\theta_p\), the corresponding moments \[ \begin{align} \mu_1& = g_1(\mathbf{\theta})\\ \vdots & \\ \mu_p & = g_p(\mathbf{\theta}) \end{align} \] yield a system of equations to solve to find the method of moments estimators.
Maximum Likelihood
Method: Choose the parameter which corresponds to the “most likely” distribution given the data. For a given distribution with probability mass/density function \(f(x;\theta )\), from which we assume the data (\(x_{i}\)) are iid generated, define the likelihood as \(L(\theta) = \prod_{i=1}^{n} f(x_{i};\theta )\), the maximum likelihood estimate is: \[ \hat{\theta}_{ML} = \underset{\theta\in\Theta}{\operatorname{argmax}} L(\theta) \] Generally, we maximise \[ \begin{align} l(\theta)&=\log\left( L(\theta)\right)\\ &=\sum_{i=1}^n\log\left(f(x_i;\theta)\right) \end{align} \] as it is more convenient to work with and results in the equivalent estimate. The maximum likelihood estimator (MLE) has several nice properties which we will explore.
Theory questions
Question 1
A random sample of observations, \(y_{1},y_{2},\ldots,y_{n}\), are taken from a population whose probability density function is \[ f(y;\theta) = \begin{cases} \theta y^{\theta-1}, & \text{if } 0 < y \leq 1 \\ 0, & \text{otherwise} \end{cases} \] where \(\theta > 0\).
- Find the mean of a random variable from distribution with pdf \(f(y;\theta)\).
- Find the method of moments estimator for \(\theta\).
- What is the MM estimate of \(\theta\) if \(n=4\) and \((y_{1}, y_{2}, y_{3}, y_{4}) = (0.1, 0.2, 0.33, 0.4)\)?
Question 2
Consider a data set with \(y_{1}, y_{2}, \ldots, y_{n}\) from a Bernoulli distribution \[ p(y;p)=p^y(1-p)^{1-y} \]
Write the equation for the likelihood function \(L(p|y_{1})\) for the first observation only.
Define the vector \(\boldsymbol{y}\) as \(\mathbf{y} = (y_{1}, y_{2}, \ldots, y_{n})\). Write the equation for the likelihood function for observing all \(y_{1}, y_{2}, \ldots, y_{n}\) as \(L(p|\mathbf{y})\).
Is the use of the condition (i.e. “\(\vert\)”) important in the likelihood and pmf? Is the ordering of \(\mathbf{y}\) and \(p\) important? Explain your answers.
Write the equation for the log-likelihood function \(\log(L) = l(p|\mathbf{y})\) and determine the MLE.
Rewrite \(l(p|\mathbf{y})\) as a function \(\hat{p}_{ML}(s)\) of the statistic \(s = \sum_{i=1}^{n} y_{i}\). For what values \(s\) is the function defined? What is the range of the function? For what values of \(s\) is the MLE (i.e. the maximum) unique?
Question 3
Say you have \(n\) pairs of data \(x_{i},y_{i}\) which appear to approximately follow a quadratic curve that goes through the origin, \(y = b x^2\). Find the least squares estimator of \(b\).
Question 4
A researcher is running an experiment using participants who are asked to complete a task within 10 minutes. They keep repeating the task until they fail \(k\) times. If \(1 - p\) is the unknown probability of success, then the number of successes until the experiment is completed is given by a negative binomial distribution,The pmf of a negative binomial in this situation is \[ p(y|k,p) = \frac{\Gamma(k+y)}{y!~\Gamma(k)}p^{k} (1-p)^y \] for \(y=0,1,2,\ldots\).
Let \(y_{1},y_{2},\ldots,y_{n}\) be the number of successes for \(n\) participants in the experiment. The ethics committee has accepted \(k = 5\) as a safe stopping criteria for the experiment.
- Determine the method of moments estimator for \(p\).
- Determine the maximum likelihood estimator for \(p\)
- Compare the two estimators and comment on their similarities.
Practical questions
For these questions, you will need to write functions, plot them, and optimise them. Here’s a quick summary of how to do that:
Question 1: Defining functions in R
Here is an example of defining a function in R. The
function squared_plus_z takes two arguments, it squares the
first then adds the second. Note that the function is defined so that if
only one argument is entered , it assumes that the second is \(0\).
squared_plus_z <- function(x, z = 0) # z is an argument with default value 0
{
result <- x^2 + z
return(result)
}
# applying to a vector works because the contents
# of the function are already vectorised
squared_plus_z( c(1, 11, 101), z = 1)
# you can even pipe if you want to
c(1, 11, 101) %>% squared_plus_z(z = 1)
l<-function(p,x)
{
}
l<-function(p,x)
{
lp <- log(p)
lp1 <- log(1-p)
ans <- x*lp+(1-x)*lp1
ans <- sum(ans)
return(ans)
}
Question 2: Plotting functions in ggplot
gpplot2 is a powerful graphic package in R that we can use to create various high-quality graphics. Here is an example of plotting the function “squared_plus_z()” we created previously.
dummy_data <- tibble(x = c(-3,3))
ggplot(data = dummy_data, aes(x = x) ) +
stat_function(fun = squared_plus_z, args = list(z = 1)) +
ylab(expression("f(x)="*x^2+1))
x<-c(0,1,1,0,0,0,0,1,0,0)
Optimising functions with optim
The optim function is just one example of optimisation
in R. Here we minimise \(x^2 -
1\) over \(x \in [-1,1]\).
Several methods in optim require you to specify an interval
(i.e. specify a bounded problem), but some don’t.
method = "Brent" is only for one-dimensional
optimisation.
squared_plus_z <- function(x, z = 0) # z is an argument with default value 0
{
result <- x^2 + z
return(result)
}
# applying to a vector works because the contents
# of the function are already vectorised
squared_plus_z( c(1, 11, 101), z = 1)
#> [1] 2 122 10202
# you can even pipe if you want to
c(1, 11, 101) %>% squared_plus_z(z = 1)
#> [1] 2 122 10202
# initial value for solver
init_val <- 1
opt_res <- optim(par = init_val,
fn = squared_plus_z,
z = -1,
method = "Brent",
lower = -1,
upper = 1
)
# optima and value
round(c(opt_res$par, opt_res$value),2)
#> [1] 0 -1
# optim minimises by default.
# To change use argument:
control = list(fnscale = -1)
Now, find the MLE of \(p\) for using the log-likelihood function you created in the previous exercises.
x<-rbinom(10,1,0.4)
Question 3
Banjo is a St. Bernard. He is 80kg and not easily kept out of water.
Suppose you would like to model how quickly St. Bernards grow from 0
to 12 months old and you have been given some data on age and weight of
St. Bernards from a tutor who is very interested in large dogs. The data
is contained in the R chunk below.
age_months <- c(0.4, 2.3, 2.9, 4.5, 4.9, 7.9, 10.1, 11.5)
weight_kgs <- c(1.8, 7.1, 17.0, 20.4, 36.1, 55.2, 58.4, 77.1)
Plot the data in
R.Assuming their weight increases linearly, numerically estimate the model \[ \text{weight}_{i} = \alpha + \beta~\text{age}_{i} \] using the least squares method by following the steps. Fix \(\alpha = 1\), that is St. Bernards are assumed to be 1kg when born.
Define the least squares function to be optimised. It should take the parameter(s) and data as arguments.
Optimise the least squares function using
optim, try usingmethod = "BFGS".Plot the least squares fit against the data. You will need to extract the fit from the result of
optim. Try usingobject_name$par.Compare your answer to the linear model function (
lm) inR. Why are the values of \(\beta\) different? Based on this, what should you be cautious about when performing estimation on parameters?
Supplementary Questions
Question 4
For the gamma distribution \[ f(x;\alpha,\beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}e^{-\beta x} \] Given the data:x<-c(2.3905022, 3.4975296, 2.6559558, 1.6964185, 1.4185724, 4.7956602, 3.0188619,
1.9287381, 3.7493438,1.4089986, 0.4889999, 1.7204471, 4.0426991, 2.4519901,
1.8883987, 2.0412268, 1.9829191, 0.6888060, 3.4596279, 2.5985436)
Find the Least Squares Estimators for \(\alpha\) and \(\beta\) where \[ \operatorname{E}(X)=\frac{\alpha}{\beta} \] is this possible? Why or why not?
Find the method of moments estimators for \(\alpha\) and \(\beta\) where \[ \operatorname{E}(X)=\frac{\alpha}{\beta} \] and \[ \operatorname{Var}(X)=\frac{\alpha}{\beta^2} \]
Find the maximum likelihood estimators for \(\alpha\) and \(\beta\), compare them to the method of moments estimators.
Plot the log-likelihood function. Create two plots one where you set \(\beta=\hat{\beta}\) (the MLE of \(\beta\)) and plot the log-likelihood for \(\alpha\), and one where you set \(\alpha=\hat{\alpha}\) and plot the log-likelihood for \(\beta\).
(BONUS: See if you can create a 3-D plot of the log-likelihood for both parameters.)