MXB341 Worksheet 05 - Asymptotics and confidence intervals

Finding Confidence Intervals

How do we find intervals for estimators that effectively communicate our confidence in the estimation process? We can outline three methods: exploiting the asymptotic behaviour of maximum likelihood estimators, inverting test statistics, and using pivotal quantities. Each of these approaches has its benefits, but we need to consider their limitations against the theoretical minimum width confidence interval.

Asymptotic normality of MLEs

When working with maximum likelihood estimators (MLE), we can show that under certain regularity conditions, the distribution of the MLE of a parameter \(\hat{\theta}\) converges to a Gaussian distribution as the sample size \(n\) approaches infinity \[ \sqrt{\mathcal{I}_n(\theta)}(\hat{\theta}-\theta)\sim N(0,1)\text{ as }n\rightarrow\infty. \]

Practically speaking, if \(n\) is sufficiently large, we can assume for making an inference that the MLE’s follows a Gaussian distribution. Thus, we can use this fact to make statements of probability associated with the MLE.

Note that we parameterise the sampling distribution for \(\hat{\theta}\) by \(\theta\), the parameter of interest. This parameterisation can be a source of confusion in interpreting the statement of probability about \(\hat{\theta}\) with regards to \(\theta\).

Remember that the parameter \(\theta\) is not a random variable; it has a fixed (but unknown) value. Any statements of probability made using the sampling distribution of the MLE apply only to the MLE.

Consider the Central Limit Theorem, which states that the distribution of the MLE will converge to a Gaussian distribution as the sample size \(n\) approaches infinity.
\[ \sqrt{n}(\bar{x}-\mu)\rightarrow N\left(0,\sigma_{MLE}^2\right). \]

implying that \[ Z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}\sim (N(0,1) \] then \[ Pr\left(Z_{\alpha/2}\leq\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}\leq Z_{1-\alpha/2}\right)=\alpha \] and the \((1-\alpha)\%\) Confidence Interval is defined as \[ \bar{x}\pm Z_{1-\alpha/2}\frac{\sigma}{\sqrt{n}}. \] It is very important to note that the statement of probability that is the basis for the confidence interval is not a statement of probability regarding \(\mu\), as \(\mu\) is not a random variable.

“Some aspects of conditional and asymptotic inference”. Sankhyā 50(3), 1988, pg 314.↩︎

Inverting the Test Statistic

There is an inherent connection between confidence intervals and hypothesis testing. Both of these procedures rely on the sampling distribution of an estimator \(\hat{\theta}\) for making inference about \(\theta\) the parameter of interest. We will also see that the two procedures are, in a sense, complementary means of making an inference.

Specifically, for a given point null hypothesis test with Type I error rate of \(\alpha\) and the hypotheses \[ H_0:\theta = \theta_0\qquad H_A:\theta\neq\theta_0 \] a rejection region \(R\) is defined based on the sampling distribution of the estimator \(\hat{\theta}\), the rejection region is defined as the disjoint region of \(\theta\in\Theta\) that is far enough away from \(\theta_0\) that the probability of observing the test statistic \(\hat{\theta}\) given that \(\theta=\theta_0\). The confidence interval \(S\) is the complement of the rejection region \(R\); in other words, it is a continuous region containing \(\bar{x}\) computed assuming that \(\theta=\theta_0\) and the decision to reject the null hypothesis occurs when \(\theta_0\notin S\).

We can conceptualise The \(1-\alpha\%\) confidence interval as the bounds of the compliment to the rejection region \(R\) for the hypothesis test with Type I error rate of \(\alpha\), i.e. we would reject the null hypothesis if the confidence interval did not contain \(0\).

Wilk’s theorem (for confidence intervals)

Approximate log-likelhood intervals are based on the Chi-squared distribution we used for hypothesis testing (Wilk’s theorem). For large samples \(n\), the log-likelihood ratio has an approximate Chi-squared distribution \[2 \left( \ell(\hat{\boldsymbol{\theta}}) - \ell(\boldsymbol{\theta}) \right) \sim \chi^{2}_{p} \] where \(p\) is the number of parameters estimated. Often it is neccessary to use an taylor approximation on \(2 \left( \ell(\hat{\boldsymbol{\theta}}) - \ell(\boldsymbol{\theta}) \right)\) because it is hard to invert. This leads to the approximate distribution \[ \left( \hat{\boldsymbol{\theta}} - \boldsymbol{\theta} \right)^{\top}\mathcal{J} \left( \hat{\boldsymbol{\theta}} - \boldsymbol{\theta} \right) \sim \chi^{2}_{p}\] where \(\mathcal{J}\) is the observed information matrix.

Pivotal Quantities

The test statistic \(Z\sim N(0,1)\) \[ Z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}. \] has a probability density function that does not depend on the parameters of the probability density function of the random variable \(X\). The quantity \(Z\) is called a pivotal quantity, a pivotal quantity is defined \(Q(\mathbf{x},\theta)\) is a pivotal quantity of \(\theta\) if the probability distribution of \(Q\) does not depend on \(\theta\) the parameters of the probability distribution of \(X\).

While the function \(Q(\mathbf{x},\theta)\) will in most cases explicitly contain both parameters and statistics, for any set \(A\), \(Pr(Q(\mathbf{x},\theta)\in A)\) cannot depend on \(\theta\). The technique of using pivotal quantities to construct confidence intervals relies on the ability to find the set \(\{\theta:Q(\mathbf{x},\theta)\in A\}\) such that \(A\in\Theta\) where \(\Theta\) is the domain of \(\theta\).

In the case of location-scale probability distributions (i.e. pdfs that are parameterised in terms of location and scale) there are typically several pivotal quantities available. If we consider the sample \(x_1,\ldots,x_n\) from the pdfs in the table below, and the sample mean \(\bar{x}\) and sample standard deviation \(s\) the resulting pivotal quantities are available.

Form of pdf	Type of pdf	Pivotal quantity
\(f(x-\mu)\)	Location	\(\bar{x}-\mu\)
\(\frac{1}{\sigma}f\left(\frac{x}{\sigma}\right)\)	Scale	\(\frac{\bar{x}}{\sigma}\)
\(\frac{1}{\sigma}f\left(\frac{x-\mu}{\sigma}\right)\)	Location-scale	\(\frac{\bar{x}-\mu}{\sigma}\)

Minimum Width Confidence Intervals

Suppose we conceive confidence intervals as the complement of hypothesis tests, or specifically as the complement to the rejection region. In that case, as we compare different methods of constructing confidence intervals, it makes sense that the measure of a test, its power, has a complement for confidence intervals. Power is a measure of a test’s ability to detect the difference between the true parameter value and its hypothesised value. For a given Type I error rate \(\alpha\), the best or most powerful test is the one with the largest rejection region; conversely, the “best confidence interval for a given confidence level \(1-\alpha\) is the narrowest. As we have seen in the previous examples, finding the minimum width confidence interval is not always straightforward.

The following theorem provides guidance for finding the minimum width confidence interval.

Finding the Minimum Width Confidence Interval:

Let \(f(x)\) be a unimodal pdf with \(f(x)>0\:\forall x\in X\). If the interval \([a,b]\) satisfies the conditions

\(\int_a^bf(x)dx = 1-\alpha\)
\(f(a)=f(b)\)
\(a\leq x^*\leq b\mbox{ where }x^*\mbox{ is the mode of }f(x)\).

then \([a,b]\) is the shortest interval that satisfies the condition \(1\).

These conditions show that for symmetric distributions, like the Gaussian or Student’s-\(t\), we can show that the minimum width confidence interval is the \(\alpha/2\) and \(1-\alpha/2\) quantiles, which we can find analytically with little effort.

For asymmetric or skewed distributions finding the points \(a\) and \(b\) that satisfy the theorem’s conditions can be a challenge. In cases where the resulting interval is a non-linear function of \(a\) and \(b\), the theorem’s conditions will not apply, and attempting to do so (as we will see shortly) will not yield the desired results.

Theory questions

Question 1

Let \(y_1,y_2,\ldots,y_n\) be a random sample of observations from a population described by the Binomial probability model \[ p(y~\vert~\theta,k) = \left(\begin{array}{c}k\\y\end{array}\right)\theta^y (1-\theta)^{k-y}, \qquad y=0,1,2,\ldots,k, \] where \(k\) is known. The log-likelihood for such a model² is

\[\ell(\theta) = n \bar{y} \log \theta + n(k-\bar{y})\log (1-\theta) + C\] where \(C\) is a constant not depending on \(\theta\).

What is the expected (Fisher) information, \(\mathcal{I}_{n}(\theta)\), for the parameter \(\theta\)? How about \(\mathcal{I}(\theta)\)?
Using asymptotic normality of the MLE for \(\theta\), construct the 90% confidence interval. What is the 90% confidence interval if you independently observe 5 trials, 10 times each, and the total number of successes is 20?
Discuss if the normal approximation for the confidence interval is appropriate for the binomial model.

See previous worksheets for details.↩︎

Question 2

Let \(y_1,y_2,\ldots,y_n\) be a random sample from a population described by the Poisson distribution with mean \(\lambda\).

Determine the likelihood of the data.
What is the MLE for the model?
What is the expected Fisher information for the model?
Construct a 95% confidence interval for the MLE using the asymptotic normal approximation.

Solution:
a. The log-likelihood is \[ L(\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} \] \[ = \frac{e^{-n\lambda}\lambda^{n \bar{y}}}{\prod_{i=1}^ny_i!} \]

b. The log-likelihood is \[ \ell(\lambda) = \log L(\lambda) \] \[ = -n\lambda + (n \bar{y})\log \lambda - \sum_{i=1}^n \log (y_i!) \] The first derivative (score function) is \[ \frac{\partial l}{\partial\lambda} = -n + \frac{n\bar{y}}{\lambda} \] Setting this function to zero and solving for \(\lambda\) gives the MLE \(\hat{\lambda} = \bar{y}\).

c. The second derivative of the log-likelihood is \[ \frac{\partial^2 l}{\partial\lambda^2} = -\frac{n\bar{y}}{\lambda^2} \] Noting that the expected value of a Poisson random variable \(y_{i}\) is \(\lambda\), the expected value of \(\bar{y}\) can be found by \(\text{E}(\bar{y}) = \frac{1}{n}\sum_{i=1}^n\text{E}(y_{i})= \frac{1}{n}\sum_{i=1}^n\lambda = \lambda\). We will use this in the Fisher information matrix derivation below. The Fisher information matrix is \[ \mathcal{I}_{n}(\lambda) = -\text{E}\left(\frac{\partial^2 \ell}{\partial\lambda^2}\right) \] \[ = \frac{n\text{E}(\bar{y})}{\lambda^2} \] \[ = \frac{n\lambda}{\lambda^2} \] \[ = \frac{n}{\lambda}. \] The single observation Fisher information is simply: \[ \mathcal{I}(\lambda) = \mathcal{I}_{n}(\lambda)/n = 1/\lambda \]

d. From the asymptoptic normality of the MLE we have \[ \sqrt{n}(\hat{\lambda} - \lambda) \sim N(0, \mathcal{I}(\hat{\lambda})^{-1}) \] in the limit as \(n\rightarrow \infty\). We can standardise this as \[ \sqrt{n}\mathcal{I}(\hat{\lambda})^{1/2}(\hat{\lambda} - \lambda) \sim N(0, 1) \] which can be written different ways (e.g. \(\mathcal{I}_{n}(\hat{\lambda})^{1/2}(\hat{\lambda} - \lambda) \sim N(0, 1)\)). The 95% confidence interval bounds for the standard normal, \(N(0,1)\), are approximately \([-1.96,1.96]\). So our confidence interval is \[ -1.96 < \sqrt{n}\mathcal{I}(\hat{\lambda})^{1/2}(\hat{\lambda} - \lambda) < 1.96 \] \[ -1.96 < \frac{\sqrt{n}}{\sqrt{\hat{\lambda}}}(\hat{\lambda} - \lambda) < 1.96 \] \[ \frac{-1.96\sqrt{\hat{\lambda}}}{\sqrt{n}} < \hat{\lambda} - \lambda < \frac{1.96\sqrt{\hat{\lambda}}}{\sqrt{n}} \] \[ \frac{-1.96\sqrt{\hat{\lambda}}}{\sqrt{n}}-\hat{\lambda} < - \lambda < \frac{1.96\sqrt{\hat{\lambda}}}{\sqrt{n}}-\hat{\lambda} \] \[ \frac{-1.96\sqrt{\hat{\lambda}}}{\sqrt{n}}+\hat{\lambda} < \lambda < \frac{1.96\sqrt{\hat{\lambda}}}{\sqrt{n}}+\hat{\lambda} \] or in other words the 95% confidence interval for \(\lambda\) is \[ \hat{\lambda} \pm \frac{1.96\sqrt{\hat{\lambda}}}{\sqrt{n}}. \]

Question 3

Let \(y_1,y_2,\ldots,y_n\) be a random sample from a population described by the Poisson distribution with mean \(\lambda_1\). Let \(x_1,x_2,\ldots,x_m\) be a random sample from another population described by the Poisson distribution with mean \(\lambda_2\). Suppose we are interested in the parameter \(\tau=\lambda_2/\lambda_1\).

Write down the combined log-likelihood from both samples in terms of the parameters \(\lambda=\lambda_1, \tau= \lambda_2/\lambda_1\).
Determine the maximum likelihood estimates of the parameters \(\tau\) and \(\lambda\)³.
What is the observed information matrix with respect to \(\hat{\lambda},\hat{\tau}\)?
Hence determine the quadratic form of an approximate 90% confidence interval for the parameters \(\lambda,\tau\).
Use part (d) to devise an approximate hypothesis test for null hypothesis \(H_{0}: \lambda_{1} = \lambda_{2} = 2\) and alternative \(H_{1}: \lambda_{1} \neq 2 \quad \operatorname{or} \quad \lambda_{2} \neq 2\) with \(\alpha = 0.1\). State the rejection region.

Solution:
a. The log-likelihood can be found by combining the likelihoods for the \(x_{i}\) and \(y_{i}\) data \[ L(\lambda,\tau) = \left[\prod_{i=1}^n \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} \right]\left[\prod_{j=1}^m \frac{e^{-\tau\lambda}(\tau\lambda)^{x_j}}{x_j!}\right] \] \[ =\frac{e^{-\lambda(n+\tau m)}\lambda^{\sum_{i=1}^n y_i+\sum_{j=1}^m x_j}\tau^{\sum_{j=1}^m x_j}}{\prod_{i=1}^n y_i! \prod_{j=1}^m x_j!} \] \[ l(\lambda,\tau) = -\lambda(n+\tau m)+\left(\sum_{i=1}^n y_i+\sum_{j=1}^m x_j\right)\log \lambda +(\sum_{j=1}^m x_j)\log \tau + \text{const} \] \[ = -\lambda(n+\tau m)+\left(n\bar{y}+m\bar{x}\right)\log \lambda +(m\bar{x})\log \tau + \text{const} \]

b. From the derivatives (or using the fact that \(\hat\lambda_1=\bar y\) and \(\hat\lambda_2=\bar x\)) we can show that \(\hat{\lambda} = \bar y\) and \(\hat{\tau} = \frac{\bar x}{\bar y}\). The derivatives/score functions are: \[ \frac{\partial l(\lambda,\tau)}{\partial \lambda} = -(n+\tau m)+\frac{\left(n\bar{y}+m\bar{x}\right)}{\lambda} \] \[ \frac{\partial l(\lambda,\tau)}{\partial \tau} = -\lambda m +\frac{m\bar{x}}{\tau} \]

c. The second derivatives are \[ \frac{\partial^{2} l(\lambda,\tau)}{\partial \lambda^{2}} = - \frac{\left(n\bar{y}+m\bar{x}\right)}{\lambda^{2}} \] \[ \frac{\partial^{2} l(\lambda,\tau)}{\partial \tau^{2}} = -\frac{m\bar{x}}{\tau^{2}} \] \[ \frac{\partial^{2} l(\lambda,\tau)}{\partial \lambda \partial \tau} = -m \] therefore the observed information matrix is \[ \mathcal{J} = \begin{bmatrix} \frac{\left(n\bar{y}+m\bar{x}\right)}{\hat{\lambda}^{2}} & m\\\ m & \frac{m\bar{x}}{\hat{\tau}^{2}} \end{bmatrix} \]

d. Using the Wilk's theorem (quadratic approximation), we can form a confidence region using the Chi-square distribution (approximation) \[ \left[ \begin{bmatrix} \hat{\lambda} \\\ \hat{\tau} \end{bmatrix} - \begin{bmatrix} \lambda \\\ \tau \end{bmatrix} \right]^{\top} \begin{bmatrix} \frac{\left(n\bar{y}+m\bar{x}\right)}{\hat{\lambda}^{2}} & m\\\ m & \frac{m\bar{x}}{\hat{\tau}^{2}} \end{bmatrix} \left[ \begin{bmatrix} \hat{\lambda} \\\ \hat{\tau} \end{bmatrix} - \begin{bmatrix} \lambda \\\ \tau \end{bmatrix} \right] \leq 4.60517 \] since `qchisq(p = 0.9, df = 2)` \(= 4.60517\).

e. The given null hypothesis is equivalent to \(\lambda = 2\) and \(\tau = 1\), also the critical level (\(\alpha\)) is equivalent to a 90% confidence level. Therefore, the rejection region can be inferred from part (d), that is \[ \left[ \begin{bmatrix} \hat{\lambda} \\\ \hat{\tau} \end{bmatrix} - \begin{bmatrix} 2 \\\ 1 \end{bmatrix} \right]^{\top} \begin{bmatrix} \frac{\left(n\bar{y}+m\bar{x}\right)}{\hat{\lambda}^{2}} & m\\\ m & \frac{m\bar{x}}{\hat{\tau}^{2}} \end{bmatrix} \left[ \begin{bmatrix} \hat{\lambda} \\\ \hat{\tau} \end{bmatrix} - \begin{bmatrix} 2 \\\ 1 \end{bmatrix} \right] > 4.60517 \]

Hint: Use the invariance property of maximum likelihood estimates↩︎

Practical questions

Question 4

Using Q3(d) we will create a contour plot for the 90% confidence interval. Assuming \(\hat{\lambda}_{1} = 5\), \(\hat{\lambda}_{2} = 10\), while \(n = 20\) and \(m = 10\).

Create a function with input \(\lambda,\tau\) that evaluates the left hand side of the confidence interval you derived.
Generate a grid of points for \(\lambda,\tau\), then evaluate these points with the function in (a). Use the crossing function from package tidyr. An example of plotting a contour is given below.⁴

Example of a contour plot


      library(tidyr); library(dplyr); library(ggplot2);
      # Change "to" and "from" values
      tau <- seq(from = -1, to = 1,length.out = 10) 
      lambda <- seq(from = -1, to = 1, length.out = 10)
      plot_data <- crossing(tau, lambda)
      
      plot_data <- plot_data %>%
        mutate(CI = tau^2 + lambda^2)
      # The CI variable draws a circle 
      ggplot() + 
        geom_contour(data = plot_data, 
                     aes(x = tau, y = lambda, z = CI, 
                         colour = factor(..level..)), 
                     breaks = c(0.5,1)
                     ) +
        scale_color_discrete("CI quantile value") +
        theme_bw()

Example of contour plot.

Hint: you may have to try several grids to get the correct confidence interval.↩︎


# (a) Construct the function to compute the values of the confidence region


confidence_region_value <- function(lambda, tau){
 
  # mles
  lambda_hat <- 5
  tau_hat <- 0.5
  
  lambda_2_hat <- lambda_hat / tau_hat
  
  n <- 20
  m <- 10
  
  # obs information matrix
  obs_info <- matrix(0, ncol=2, nrow = 2)
  obs_info[1,1] <- (n * lambda_hat + m * lambda_2_hat)/(lambda_hat^2)
  obs_info[1,2] <- m
  obs_info[2,1] <- m
  obs_info[2,2] <- lambda_2_hat/(tau_hat^2)
  
  vec_val <- matrix(c(lambda_hat - lambda, tau_hat - tau), ncol = 1)
  
  conf_value <- as.numeric( t(vec_val) %*% obs_info %*% vec_val )
  
  return(conf_value)
  
}


# (b)  Plot the results


CI_quant_vals <- round(qchisq(p = c(0.8, 0.9, 0.95), df = 2),2)
 # quantiles for 1 - alpha values 

conf_region <- Vectorize(confidence_region_value)

 tau <- seq(from = 0, to = 1,length.out = 100) 
  lambda <- seq(from = 3, to = 7, length.out = 100)
  plot_data <- crossing(tau, lambda)
  
  plot_data <- plot_data %>%
    mutate(CI = conf_region(lambda = lambda, tau = tau))
  # The CI variable draws a circle 
  ggplot() + 
    geom_contour(data = plot_data, 
                 aes(x = tau, y = lambda, z = CI, 
                     colour = factor(..level..)), 
                 breaks = CI_quant_vals
                 ) +
    scale_color_discrete("CI quantile value") +
    theme_bw()

Week 5: Interval Estimation