MXB341 Statistical Inference
Informed decision-making comes from a long tradition of guessing and then blaming others for inadequate results. — Scott Adams (Gilbert comic strip)
Week 8: Bayesian Inference
The distinguishing aspect of Bayesian statistics is the conceptualisation of the parameter as a probability distribution rather than a fixed quantity. We can use this to make inference directly rather than interpret the uncertainty in terms of the random sample via the sampling distribution because we express the uncertainty about the parameter as a probability distribution. We can use the posterior distribution to make direct probabilistic statements about the parameter in some ways (confidence or credible intervals). While this makes interpreting interval estimates more straightforward, it is more computationally demanding as it requires choosing an appropriate prior and evaluating the posterior distribution. In some cases, there are no analogous procedures to classical or frequentist approaches (i.e. hypothesis testing). Lastly, it is important to recognise and understand the influence of the parameter’s prior distribution on the results of analyses.
Thus we will examine three approaches to inference in the Bayesian paradigm: interval estimation, hypothesis testing, and decision theory. Each of these approaches has its analogue with classical techniques, and we will explore these contrasts and comparisons as part of exploring these concepts.
Bayesian Inference
Credible Intervals
Interval estimation in the Bayesian context uses the posterior distribution to directly compute an interval-based on a specified probability
Given the posterior distribution \(\pi(\theta|\mathbf{x})\) the \(1-\alpha\) credible set is \[C(\theta)=\left\{\theta:Pr\left(\theta\in C(\theta)\right)=1-\alpha\right\}\] For univariate cases this can be defined by the interval \([a,b]\) \[ \int_a^b\pi(\theta|\mathbf{y})d\theta=1-\alpha \] thus \[ C(\theta)=\{\theta:\theta\in[a,b]\}. \] As in the case of confidence intervals, there are limited analytical tools available to minimise \(b-a\) for a given \(1-\alpha\). In many instances, numerical techniques are necessary.
A set \(C^*(\theta)=[a,b]\) that minimises \(b-a\) for a given \(1-\alpha\) \[ C(\theta)=\min_{b-a}\int_{a}^b\pi(\theta|\mathbf{y})d\theta=1-\alpha \] is the highest posterior density credible interval.
Hypothesis Testing
For hypotheses of the form \[ H_0:\theta\in \theta_0\quad\mbox{vs}\quad H_A:\theta\in \theta_0^c \] the posterior probabilities are equivalent to testing the hypotheses \[ \begin{align} Pr(\theta\in\theta_0|\mathbf{y})&\equiv Pr(\mbox{H_0 is True}|\mathbf{y})\\ Pr(\theta\in\theta_0^c|\mathbf{y})&\equiv Pr(\mbox{H_0 is False}|\mathbf{y}). \end{align} \]
But in the case of the point-null hypothesis \[ H_0:\theta=\theta_0\quad\mbox{vs}\quad H_A:\theta\neq \theta_0 \] \(Pr(\theta=\theta_0)=0\) which prevents meaningful hypothesis testing. (The exception being when the parameter space is discrete, but this is a rare occurrence.)
Bayes Factors
Bayes factors are one alternative to classical hypothesis testing. If there are two competing models, \(M_{1}\) and \(M_{2}\), we compare the marginal likelihood of the hypotheses they represent. The marginal likelihood is the probability of observing the data for a given model, i.e. \(P(\text{data}) = P(\boldsymbol{x})\) (this is the normalising constant in Bayes rule above!). Since we have two different models we add the conditioning: \[ Pr(\mathbf{x}\vert~M_{i}) = Pr(\boldsymbol{x}\vert M_{i}) = \int_{\theta_{i}} f(\boldsymbol{x}\vert\theta_{i}, M_{i})\pi(\theta_{i}\vert M_{i}) \text{ d}\theta_{i} \] but can also be written as \[ Pr(\boldsymbol{x}\vert M_{i}) =\frac{Pr(M_{i} \vert \boldsymbol{x})Pr(\boldsymbol{x})}{Pr(M_{i})} \] using Bayes rule. The Bayes factor comparing these two models is \[ BF_{12} = \frac{Pr(\boldsymbol{x}~\vert~M_{1})}{Pr(\boldsymbol{x}\vert M_{2})} = \frac{Pr(M_{1} \vert \boldsymbol{x}) Pr(M_{2})}{Pr(M_{2}\vert\boldsymbol{x}) Pr(M_{1})} \] the interpretation of \(BF_{12}\) is given by tables, but there is more than one that can be used.
Bayesian Decision Theory
Every decision \(d\) over some data \(y\) involves potential loss, depending on the decision \(d(y)\) and the true state of reality \(\theta\). \[ \operatorname{Loss} = L (\theta, d(y)) \] Based on this loss, we can define the risk associated with each decision as:
\[ R(\theta, d(y)) = E_{\theta}[L(\theta, d(y))] \] There are two approaches to using risk as a basis for decision making. The first one is called minimax: \[ \underset{d}{\mathrm{min}}\left[\underset{\theta}{\mathrm{max}}R(\theta, d(y)) \right] \] The Bayesian approach to decision making is to make choices based on considering risk as a function of \(\theta\) and evaluating the Bayes Risk: \[ \underset{d(x)}{\mathrm{min}} \int_{\Theta} R(\theta, d(y)) \pi(\theta)\quad d\theta \] This is referred to as the Bayes Risk if the posterior distribution is used in lieu of the prior the Bayes Risk becomes
\[ \int_{\Theta} R(\theta, d(y)) \pi(\theta | \operatorname{X})\quad d\theta \] and the Bayes Rule is to choose the action or decision that minimises the posterior expected loss.
Theory questions
Question 1
Assume \(x_{1}, x_{2}, \ldots, x_{n}\) are exponentially distributed with rate \(\lambda\), show that the gamma prior \(\lambda \sim \text{Gamma}(\alpha,\beta)\) is conjugate for the exponential distribution. What is the marginal likelihood of the data? Show that the posterior predictive distribution is a Lomax distribution.
Question 2
Consider a data from a Poisson process (i.e. realisations of a Poisson random variable \(X\)) where \[ Pr(X=x|\lambda)=\frac{\lambda^xe^{-\lambda} }{x!} \] Assume a prior for \(\lambda\) \[ \pi(\lambda)=4\lambda e^{-2\lambda}. \] For the data \(\mathbf{x}=(x_1,x_2,\ldots,x_n)\)
Find the posterior distribution of \(\lambda|\mathbf{x}\)
Find the \(95\%\) credible interval \(C\) for \(\lambda\) and \(\lambda|\mathbf{x}\)
Find \[ Pr(\lambda)>1\text{ and } Pr(\lambda>1|\mathbf{x}) \]
Plot the prior and posterior of \(\lambda\) together, comment? How did the data effect the posterior compared to the prior. Repeat this for \[ \pi(\lambda)=e^{-\lambda} \]
# (b)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
# (c)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
# (d)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
sumx<-sum(x)
n<-length(x)
alpha<-2+sumx
beta<-2+n
## Just compute the equal tail credible interval
ll<-qgamma(0.025,shape = alpha,rate = beta)%>%round(.,2)
ul<-qgamma(0.975,shape = alpha, rate = beta)%>%round(.,2)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
sumx<-sum(x)
n<-length(x)
alpha<-2+sumx
beta<-2+n
## Just compute the equal tail credible interval
prior_prob<-pgamma(1,shape = 2,rate = 2)%>%round(.,2)
post_prob<-pgamma(1,shape = alpha, rate = beta)%>%round(.2)
x<-c(2, 1, 4, 3, 4, 2, 1, 2, 0, 0)
sumx<-sum(x)
n<-length(x)
alpha<-2+sumx
beta<-2+n
x<-seq(0,5,len=1000)
prior_2<-dgamma(x,shape = 2, rate = 2)
prior_1<-dgamma(x, shape = 1 , rate = 1)
posterior_2<-dgamma(x,shape = alpha, rate = beta)
posterior_1<-dgamma(x,shape = alpha -1, rate = beta-1)
df<-tibble(X=rep(x,4),pi = c(prior_2,posterior_2,prior_1,posterior_1),names=c(rep("prior 2",1000),rep("posterior 2",1000),rep("prior 1",1000),rep("posterior 1",1000)))
ggplot(df,aes(x=X,y = pi,color=names))+
geom_line()
Question 3
Consider the the absolute and quadratic loss functions, \(L_{1}(\beta, \hat{\beta}) = \vert\beta - \hat{\beta} \vert\) and \(L_{2}(\beta, \hat{\beta}) = (\beta - \hat{\beta})^{2}\) respectively.
We are going to perform a simulation to test certain estimators (posterior mean, median, and mode) for these loss functions in a Bayesian context. Suppose you have data \((y_{i} ~ \vert ~ \beta) \sim \text{Weibull}(\lambda, k = 0.4)\) distribution and wish to use prior on \(\beta \sim \text{invGamma}(\alpha_{0} = 0.1,\beta_{0} = 2)\). Note, this is a conjugate prior, and the posterior distribution has parameters \(\alpha_{n} = \alpha_{0} + n\) and \(\beta_{n} = \beta_{0} + \sum_{i=1}^{n} y_{i}^{k}\).
Simulate some Weibull data with \(n = 20\) and the parameters \(\lambda = 1\) (scale), \(k = 0.4\) (shape).
Write functions (or a single function) that calculates the posterior distribution’s: (i) \(\alpha_{n}\) and \(\beta_{n}\), (ii) posterior mean, (iii) posterior mode, and (iv) posterior median1 given some data points \(\boldsymbol{y}\). Make sure you test your code with the data in part (a).
Using (b), write a function to simulate from the posterior distribution of \(\beta\).
We can estimate the Bayesian risk2 for a given estimate by simulation from the posterior distribution and calculating an average. Write a function to calculate the average of each type of loss above using samples from the posterior. It will use a given estimator, \(\hat{\beta}\), from the posterior, as well as samples from the posterior of \(\beta\) to calculate the average.
Use the above functions to evaluate the estimators from part (b) using the estimate Bayes risk in part (c). Also try using \(n=3\) and \(n=100\). Comment on the results.
# (a) here
n <- 20
# parameters for weibull data
weibull_scale <- 1 # what we are trying to estimate
weibull_shape <- 0.4 # fixed (equal to k)
# simulate some weibull data
y <- rweibull(n = n, shape = weibull_shape, scale = weibull_scale)
# (b) here
library(extraDistr)
posterior_beta_weibull_invgamma <- function(y, alpha_0, beta_0, k){
# prior: lambda ~ invgamma(a,b)
# data model: x ~ weibull(lambda, k = weibull_shape)
# parameters for invgamma posterior:
alpha_n <- alpha_0 + length(y)
beta_n <- beta_0 + sum(y^k)
# bayes estimates of beta
post_beta_mean <- beta_n / (alpha_n - 1)
post_beta_mode <- beta_n / (alpha_n + 1)
post_beta_median <- qinvgamma(p = 0.5, alpha = alpha_n, beta = beta_n)
return(
list(
alpha_n = alpha_n,
beta_n = beta_n,
mean = post_beta_mean,
mode = post_beta_mode,
median = post_beta_median
)
)
}
# test code
# parametes for prior
prior_invgamma_alpha <- 0.1
prior_invgamma_beta <- 2
posterior_results <-
posterior_beta_weibull_invgamma(y = y,
alpha_0 = prior_invgamma_alpha,
beta_0 = prior_invgamma_beta,
k = weibull_shape)
# access one element of list:
posterior_results$alpha_n
# one way to veiw all...
posterior_results %>% c(recursive = T)
# (c) here
# calculating an average loss from the posterior is a
# Monte Carlo estimate of the Bayesian risk
loss_absolute_avg <- function(beta_hat, beta_posterior_samples){
mean( abs(beta_hat - beta_posterior_samples) )
}
loss_quadratic_avg <- function(beta_hat, beta_posterior_samples){
mean( (beta_hat - beta_posterior_samples)^2 )
}
# (d) here
# compare the estimators using the estimated
# Bayesian risk (average of loss functions)
# one simulation in a study would look like this:
# generate data
y <- rweibull(n = 3, shape = weibull_shape, scale = weibull_scale)
# calc posterior values
posterior_vals <-
posterior_beta_weibull_invgamma(y = y,
alpha_0 = prior_invgamma_alpha,
beta_0 = prior_invgamma_beta,
k = weibull_shape)
beta_posterior_samples <-
rinvgamma(n = 10000,
alpha = posterior_vals$alpha_n,
beta = posterior_vals$beta_n)
# there are better ways of doing this using the purrr package.
bayes_risk <- tibble(
loss_function = c("abs", "abs", "abs", "quad", "quad", "quad"),
estimator = c("mean", "mode", "median", "mean", "mode", "median"),
value = c(loss_absolute_avg(posterior_vals$mean, beta_posterior_samples),
loss_absolute_avg(posterior_vals$mode, beta_posterior_samples),
loss_absolute_avg(posterior_vals$median, beta_posterior_samples),
loss_quadratic_avg(posterior_vals$mean, beta_posterior_samples),
loss_quadratic_avg(posterior_vals$mode, beta_posterior_samples),
loss_quadratic_avg(posterior_vals$median, beta_posterior_samples)
)
)
bayes_risk
Practical questions
Question 4
In this question you will use Q2 and Q3 (theory) in order to choose between using a log-normal distribution with conjugate prior and an exponential distribution with Gamma prior using a Bayes factor.
Old girl Coco.
Note the following about a Bayesian model for log-normal data with a Normal-Gamma prior: The model is defined by \[ \begin{aligned} (x_{i}~\vert~\mu,\tau) &\sim \text{LogN}(\mu, \tau^{-1}) \quad \text{for } i \in \{1,2,\ldots,n\}\\ (\mu~\vert~\tau, \mu_{0}) &\sim \text{Normal}(\mu_{0}, \tau^{-1}) \\ (\tau~\vert~\alpha,\beta) &\sim \text{Gamma}(\alpha,\beta) \end{aligned} \] and has posterior distribution with hierarchy \[ \begin{aligned} (\mu~\vert~\boldsymbol{x},\tau, \mu_{0}) &\sim \text{Normal}(\mu^{\prime}, [(n+1)\tau]^{-1}) \\ (\tau~\vert~\boldsymbol{x},\alpha,\beta) &\sim \text{Gamma}(\alpha^{\prime},\beta^{\prime})\\ \mu^{\prime} &= \frac{\mu_{0} + ns_{1}}{n+1}, \quad s_{1} = n^{-1}\sum_{i=1}^{n}\log x_{i} \\ \alpha^{\prime} & = \alpha + n/2\\ \beta^{\prime} & = \beta + \frac{n}{2} \left(s_{2} + \frac{(s_{1} - \mu_{0})^{2}}{n+1} \right), \quad s_{2} = n^{-1}\sum_{i=1}^{n}(\log x_{i} - s_{1})^{2}. \end{aligned} \] The marginal likelihood of the data is \[ p(\boldsymbol{x}~\vert~\mu_{0}, \alpha^{\prime}, \beta^{\prime}) = (2\pi)^{-n/2}\frac{\Gamma(\alpha^{\prime}) \beta^{\alpha}}{\Gamma(\alpha) \beta^{\prime \alpha^{\prime}}} (n+1)^{-1/2} \] and the posterior predictive distribution is a Generalised Student’s t-distribution on the log-scale: \[ \begin{aligned} \tilde{x} & = \exp(\tilde{y})\\ (\tilde{y}~\vert~\boldsymbol{x}, \mu_{0},\alpha,\beta) &\sim \text{t}_{2\alpha^{\prime}}\left(\mu^{\prime}, \frac{\beta^{\prime}(n+2)}{\alpha^{\prime}(n+1)}\right). \end{aligned} \]
Coco (Figure 1) is an old dog who no longer eats as much food. Her human, Daemo, would like to determine how much he should be feeding Coco each day. Daemo decides that he would like us to model the feeding statistically, and lucky for us he is a Bayesian! Let \(x_{i}\) be the amount Coco eats on day \(i\), for \(i \in \{1,2,3,\ldots\}\) in kilograms. He thinks that there are two models which could describe how much coco chooses to eat: \[ \begin{aligned} &M_{1}: (x_{i}~\vert~\lambda) \sim \text{Exp}(\lambda), \quad \lambda \sim \text{Gamma}(4,1)\\ &M_{2}: (x_{i}~\vert~\mu, \tau) \sim \text{logN}(\mu, \tau^{-1}), \quad (\mu ~\vert~ \tau) \sim \text{N}(-1,\tau^{-1}), \quad \tau \sim \text{Gamma}(10,1) \end{aligned} \]
Write a function that calculates the median of the posterior predictive distribution for the \(M_{1}\) and \(M_{2}\) respectively. It should take a vector of observations as input3.
Write a function that calculates the marginal likelihood for the \(M_{1}\) and \(M_{2}\) respectively. Use this to write a function for the Bayes factor comparing the two models. These functions should also take a vector of data as input.
Suppose you speak to Daemo each evening and he reports to you how much food Coco has eaten. Over the course of a week, he reports coco eats \(x_{1} = 1, x_{2} = 0.8, x_{3} = 0.4, x_{4} = 0.35, x_{5} = 0.45, x_{6} = 0.3, x_{7} = 0.9\) kilograms of food. Show Daemo how the
- Median of the posterior predictive of each model, and
- Bayes factor
get updated by Bayes rule after each evening report.
Additionally, for each evening report
- Use the median of the predictive distributions to estimate the next observation. Record the MSE as each new observation arrives.
- After each update, make a recommendation to Daemo on whether there is enough evidence to recommend \(M_{1}\) or \(M_{2}\).
# (a) here
library(extraDistr)
library(dplyr)
# M1: Exp + Gamma
m1_postr_predict_median <- function(x, alpha, beta){
alpha_p <- alpha + length(x)
beta_p <- beta + sum(x)
# median of lomax:
plomax(0.5, lambda = 1/beta_p, kappa = alpha_p)
}
m2_postr_predict_median <- function(x, mu0, alpha, beta){
s1 <- mean( log(x) )
s2 <- mean( (log(x) - s1)^2 )
n <- length(x)
mu_prime <- (n * s1 + mu0) / (n + 1)
alpha_prime <- alpha + (n/2)
beta_prime <- beta +
(n/2) * ( s2 + ((s1 - mu0)^2)/(n + 1) )
# params of t dist
t_mean <- mu_prime
t_scale <- ( beta_prime * (n + 2) ) / (alpha_prime * (n + 1))
t_df <- 2 * alpha_prime
# median of log t dist
# P(L < t) = 0.5, L ~ log student-t
# P(log(T) < t) = P(T < exp(t)) = 0.5, T ~ student-t
exp(
qlst(p = 0.5, df = t_df,
mu = t_mean, sigma = t_scale)
)
}
x <- c(1, 0.8, 0.4, 0.35, 0.45, 0.3, 0.9)
m1_alpha <- 4
m1_beta <- 1
m1_postr_predict_median(x, m1_alpha, m1_beta)
m2_mu0 <- -1
m2_alpha <- 10
m2_beta <- 1
m2_postr_predict_median(x, m2_mu0, m2_alpha, m2_beta)
# (b) here
m1_marginal_evidence <- function(x, alpha, beta){
n <- length(x)
alpha_p <- alpha + n
beta_p <- beta + sum(x)
log_evid <- alpha * log(beta) + lgamma(alpha_p) -
alpha_p * log(beta_p) - lgamma(alpha)
exp(log_evid)
}
m2_marginal_evidence <- function(x, mu0, alpha, beta){
s1 <- mean( log(x) )
s2 <- mean( (log(x) - s1)^2 )
n <- length(x)
mu_prime <- (n * s1 + mu0) / (n + 1)
alpha_prime <- alpha + (n/2)
beta_prime <- beta +
(n/2) * ( s2 + ((s1 - mu0)^2)/(n + 1) )
log_evid <- -(n/2) * log(2 * pi) +
alpha * log(beta) + lgamma(alpha_prime) -
alpha_prime * log(beta_prime) - lgamma(alpha) -
(1/2) * log(n + 1)
exp(log_evid)
}
m1_evid <- m1_marginal_evidence(x, m1_alpha, m1_beta)
m2_evid <- m2_marginal_evidence(x, m2_mu0, m2_alpha, m2_beta)
# Bayes factor as a function:
m1_m2_bf <- function(x, m1_alpha, m1_beta, m2_mu0, m2_alpha, m2_beta){
m1_marginal_evidence(x, m1_alpha, m1_beta) /
m2_marginal_evidence(x, m2_mu0, m2_alpha, m2_beta)
}
# substantial evidence that M1 is a better model (after full week).
m1_m2_bf(x, m1_alpha, m1_beta, m2_mu0, m2_alpha, m2_beta)
# (c) here
x_each_day <- lapply(1:length(x), function(i) x[1:i])
x_each_day[[1]]
x_each_day[[3]]
# rolling update of posterior predictive median:
# m1
m1_pp_med <- sapply(x_each_day, m1_postr_predict_median,
alpha = m1_alpha, beta = m1_beta)
# m2
m2_pp_med <- sapply(x_each_day, m2_postr_predict_median,
mu0 = m2_mu0, alpha = m2_alpha, beta = m2_beta)
# rolling update of bayes factor:
bayes_factor <- sapply(x_each_day, m1_m2_bf,
m1_alpha = m1_alpha, m1_beta = m1_beta,
m2_mu0 = m2_mu0, m2_alpha = m2_alpha, m2_beta = m2_beta)
bayes_factor
# (d) here
# cumulative mean squared error (PP median versus actual observation)
# model 1
pp_mse1 <- cummean( (m1_pp_med - x)^2)
# model 2
pp_mse2 <- cummean( (m2_pp_med - x)^2)
pp_mse1 < pp_mse2
The R package
extraDistrhas packages to calculate lomax and location-scale student-t distributions.↩︎