“The impact of Gibbs sampling and MCMC was to change our entire method of thinking and attacking problems.”

— Robert Short

“These days the statistician is often asked such questions as ‘Are you a Bayesian?’ ‘Are you a frequentist?’ ‘Are you a data analyst?’ ‘Are you a designer of experiments?’. I will argue that the appropriate answer to ALL of these questions can be (and preferably should be) ‘yes’, and that we can see why this is so if we consider the scientific context for what statisticians do.”

— George E. P. Box

Week 11: Bayesian Computation

Bayesian Computation

Thus far we have seen examples of univariate sampling cases, and we have seen how we can use the hit and run sampler to draw samples from a multivariate posterior. In practice, most models used for data analysis have more than one parameter that requires inference, indeed more sophisticated models may include random effects with a hierarchical structure resulting in a high dimensional parameter space. These models call for a more robust approach to drawing samples from the joint posterior.

Gibbs Sampling

In general, when implementing Bayesian methods the preferred approach is to:

Find closed form solutions to the joint posterior.
Use Monte Carlo integration drawing independent samples directly from the joint posterior
Implement a Monte Carlo Markov Chain (MCMC) scheme to draw sample from the joint posterior

The first case rarely ever occurs in practice and often only for trivial models. The second case is highly desirable, but often not practical or possible. The third case is the most common situation an reuqires implementing a Gibbs Sampling scheme. Gibbs sampling arises both as a special case of the Metropolis-Hastings algorithm and from the limiting properties of Markov chains. We can see the motivation for the Gibbs sampler by considering a model with a Gaussian likelihood. When both the mean and the variance are unknown we would need to sample from the joint posterior \[ \pi(\mu,\sigma^2|\mathbf{y})=\frac{f(\mathbf{y}|\mu,\sigma^2)\pi(\mu)\pi(\sigma^2)}{\int\int (\mathbf{y}|\mu,\sigma^2)\pi(\mu)\pi(\sigma^2)d\mu d\sigma^2} \] which doesn’t have a nice closed form. Which motivates the Gibbs Sampling scheme:

Given that the conditional posteriors \[ \begin{align} \pi(\mu|\sigma^2,\mathbf{y})&\propto f(\mathbf{y}|\mu,\sigma^2)\pi(\mu)\\ \pi(\sigma^2|\mu,\mathbf{y})&\propto f(\mathbf{y}|\mu,\sigma^2)\pi(\sigma^2) \end{align} \] 1. Set initial values of \(\mu_{(0)}\) and \(\sigma^2_{(0)}\) and for \(i=1,\ldots,N\):

Draw \(\sigma^2_{(i)}\) from \(\pi(\sigma^2|\mu_{(i-1)},\mathbf{y})\)
Draw \(\mu_{(i)}\) from \(\pi(\mu|\sigma^2_{(i)},\mathbf{y})\).

The Gibbs Sampler

Gibbs Sampling is a surprisingly simple and robust method for drawing posteriors from a target distribution. The basic steps are:

Given a target distribution \(g(\boldsymbol{\theta})\) where \(\boldsymbol{\theta}=\theta_1,\ldots,\theta_p\). For \(t = 1,\ldots,T\) we can draw \(T\) samples from \(g\) as follows

Derive the set of full conditionals \[ \begin{align} g(\theta_1|\boldsymbol{\theta}_{-1})\\ g(\theta_2|\boldsymbol{\theta}_{-2})\\ \vdots\\ g(\theta_p|\boldsymbol{\theta}_{-p}) \end{align} \]
Set the initial values for \(\boldsymbol{\theta}^{(0)}\)
For \(t=1,\ldots,T\) draw samples as follows \[ \begin{align} \theta^{(t)}_1|&\theta_{2,\ldots,p}^{(t-1)}\\ \theta^{(t)}_2|&\theta_1^{(t)},\theta_{3,\ldots,p}^{(t-1)}\\ \theta^{(t)}_2|&\theta_{1,2}^{t},\theta_{4,\ldots,p}^{(t-1)}\\ \vdots&\\ \theta_p^{(t)}|&\theta_{1,\ldots,p-1}^{(t)} \end{align} \]

Under some specific regularity conditions the Gibbs Sampler will converge to the stationary target distribution. These conditions are ergodicity, or that sampler has the potential to full explore the domain of the posterior, and irreducibility, or that there is a non-zero probability that the chain can transition from one location to any other in the domain of the target density.

Linear Regression with Gibbs

Linear regression is one of the first and most commonly used models in statistical inference and data analysis. Illustrating it in a Bayesian context is useful both because of the ubiquity of the model but also because it serves as a building block for learning how to implement more sophisticated hierarchical and generalised linear mixed-effects models that are common in many applications.

Bayesian Linear Regression

Consider the basic linear regression model: \[ y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip}+\epsilon_i \] we can write that the individual \(y_i\) follow a Gaussian distribution \[ y_i\sim N(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip},\sigma^2) \] or we can write the joint probability distribution for the sample \(\mathbf{y}=y_1,\ldots,y_n\) as the multivariate normal distribution \[ \mathbf{y}\sim MVN(\mathbf{X\beta},\mathbf{\Sigma}) \] this is the likelihood for the joint sample. If we assume that the samples are independent \(\mathbf{\Sigma}=\sigma^2\mathbf{I}\) .

The density function for the multivartiate normal (or the likelihood) is \[ f(\mathbf{y}|\mathbf{\beta},\mathbf{\Sigma}) = (2\pi)^{-n/2}\det\left(\mathbf{\Sigma}\right)^{-1/2}\exp\left(\frac{1}{2}(\mathbf{y}-\mathbf{X\beta})^T\Sigma^{-1}(\mathbf{y}-\mathbf{X\beta})\right). \] Given some prior for \(\mathbf{\beta},\sigma^2\) the posterior can be written as \[ \pi(\mathbf{\beta},\sigma^2|\mathbf{y})\propto f(\mathbf{y}|\mathbf{\beta},\sigma^2)\pi(\mathbf{\beta},\sigma^2). \] We have previously seen that we can derive a Gibbs sampling scheme based on the full conditionals \[ \begin{aligned} \pi(\beta|\sigma^2,\mathbf{y})&\propto f(\mathbf{y}|\mathbf{\beta},\sigma^2)\pi(\mathbf{\beta})\\ \pi(\sigma^2|\mathbf{\beta},\mathbf{y})&\propto f(\mathbf{y}|\mathbf{\beta},\sigma^2)\pi(\sigma^2). \end{aligned} \]

For the conjugate priors \[ \begin{aligned} \pi(\mathbf{\beta})&\propto\exp\left(-\frac{1}{2}\mathbf{\beta}^T\mathbf{D}\mathbf{\beta}\right)\\ \pi(\sigma^2)&=\left(\frac{1}{\sigma^2}\right)^{a-1}\exp\left(-\frac{b}{\sigma^2}\right) \end{aligned} \] the full conditionals are

\[ \begin{align} \pi(\boldsymbol{\beta}|\mathbf{y},\sigma^2)&\propto f(\mathbf{y}|\boldsymbol{\beta},\sigma^2)\pi(\boldsymbol{\beta})\\ &\propto (2\pi\sigma^2)^{-n/2}\exp\left(\frac{1}{2\sigma^2}(\mathbf{y}-\mathbf{X\beta})^T(\mathbf{y}-\mathbf{X\beta})\right)\exp\left(-\frac{1}{2}\mathbf{\beta}^T\mathbf{D}\mathbf{\beta}\right)\\ &\propto \exp\left(-\frac12\left\{(\frac{\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^{T}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})}{\sigma^2}+\boldsymbol{\beta}^T\mathbf{D}\boldsymbol{\beta}\right\}\right)\\ &\propto\exp\left(-\frac12\left\{\boldsymbol{\beta}^T\left(\frac{\mathbf{X}^T\mathbf{X}}{\sigma^2}+D\right)\boldsymbol{\beta}-2\mathbf{y}^T\mathbf{X}\boldsymbol{\beta}+\mathbf{y}^T\mathbf{y}\right\}\right)\\ \pi(\sigma^2|\mathbf{y},\boldsymbol{\beta})&\propto f(\mathbf{y}|\boldsymbol{\beta}\sigma^2)\pi(\sigma^2)\\ &\propto(2\pi\sigma^2)^{-n/2}\exp\left(\frac{1}{2\sigma^2}(\mathbf{y}-\mathbf{X\beta})^T(\mathbf{y}-\mathbf{X\beta})\right)\left(\frac{1}{\sigma^2}\right)^{\alpha-1}\exp\left(-\frac{b}{\sigma^2}\right)\\&\propto \left(\frac{1}{\sigma^2}\right)^{\frac{n}{2}+a-1}\exp\left(-\frac{1}{\sigma^2}\left\{\frac{(\mathbf{y}-\mathbf{X\beta})^T(\mathbf{y}-\mathbf{X\beta})}{2}+b\right\}\right) \end{align} \] resulting in the set of full conditionals \[ \begin{align} \boldsymbol{\beta}|\mathbf{y},\sigma^2&\sim MNV\left\{\left(\frac{\mathbf{X}^T\mathbf{X}}{\sigma^2}+\mathbf{D}\right)^{-1}\frac{\mathbf{X}^T\mathbf{y}}{\sigma^2},\left(\frac{\mathbf{X}^T\mathbf{X}}{\sigma^2}+\mathbf{D}\right)^{-1}\right\}\\ \sigma^2|\mathbf{y}, \boldsymbol{\beta}&\sim InvGa\left(\frac{n}{2}+a,\frac{(\mathbf{y}-\mathbf{X\beta})^T(\mathbf{y}-\mathbf{X\beta})}{2}+b\right) \end{align} \]

Metropolis within Gibbs

In the previous example we showed how to construct the Gibbs sampler for a simple case of linear regression using conjugate priors. While the use of conjugate priors is possible in many cases, in others it may not be desirable. For generalised linear models conjugate priors are not available. In these cases the the updating step of the Gibbs sampler will need to use some method like the Metropolis-Hastings algorithm, this is referred to as Metropolis within Gibbs. Alternatively, some other method like slice sampling could also be used.

Generalised Linear Models

Generalised linear models are an extension to linear regression where for some random variable \(Y\sim f(y)\) we describe some function of the expected value as a linear combination of covariates, i.e.
\[ h\left\{\operatorname{E}(y_i)\right\}=\beta_0+\beta_1x_i. \] The function \(h\) is called a link function and is derived from the exponential family representation of the probability mass or density function. Fitting generalised linear models, either using classical or Bayesian methods is difficult. For classical methods there is no closed form for the maximum likelihood estimators, so these have to be found numberically. In Bayesian methods it means that there is no conjugate prior for the regression parameters, and thus slice or Metropolis steps are needed in a Gibbs scheme to sample from the joint posterior.

Generalised Linear Mixed Models

Generalised linear mixed models (GLMMs) introduce an added level of hierarchy to a model by adding error to the mean of the likelihood, i.e. given a set of data from some non-Gaussian likelihood we now assume that \[ E(Y|\theta)=h(\theta)=\eta \] where \[ \eta_i\sim N(\beta_0+\beta_1x_i,\delta) \] the resulting hierarchical model is \[\begin{eqnarray} f(\mathbf{y}|\boldsymbol{\eta})\\ \pi(\boldsymbol{\eta}|\beta_0,\beta_1,\delta)\\ \pi(\beta_0,\beta_1,\delta) \end{eqnarray}\]

Theory questions

Question 1

What are the full conditional densities for the following posterior distribution? \[ \pi(\mu,\tau,\alpha~\vert~\boldsymbol{y}) \propto \tau^{-n/2}\exp\left\{ - \frac{\sum_{i=1}^{n}( \mu - y_{i})^2}{\tau} \right\}\exp\{-\alpha(\tau+1)\}\alpha \] Note that \(\mu \in \mathbb{R}\), \(\alpha > 0\), and \(\tau > 0\).

Can you write the model that this posterior could have been generated from?¹

Hint: Find the data generating process, or distribution for each \(y_{i}\), and priors for the parameters.↩︎

Question 2

Consider the standard linear regression model with three parameters \(\alpha,\beta,\sigma^2\). Assume the linear model has independent normal errors.

\[y_j |\alpha, \beta,\sigma^2 \sim {\rm N} ( \alpha + \beta (x_j-\bar{x}), \sigma^2), \quad \text{for} \quad j=1,\ldots,n\] Assume uniform (improper) priors for \(\alpha,\beta, \log(\sigma^2)\).

Find the three posterior full conditionals \(p(\alpha|\beta,\sigma^2,\mathbf{y}), p(\beta|\alpha,\sigma^2,\mathbf{y}), p(\sigma^2|\alpha,\beta,\mathbf{y})\) to show that

\[(\alpha~\vert~\cdot~) \sim {\rm N} \left(\bar{y}, \frac{\sigma^2}{n} \right),\] \[( \beta~\vert~\cdot~) \sim {\rm N} \left( \frac{\sum y_j(x_j-\bar{x})}{\sum (x_j-\bar{x})^2}, \frac{\sigma^2}{\sum (x_j-\bar{x})^2} \right)\] and for \((\sigma^2 ~\vert~ \alpha,\beta)\) \[\frac{\sum_j (y_j-\alpha-\beta (x_j-\bar{x}))^2}{\sigma^2} \sim \chi^2_n\]

Solution:
Noting that if \[ \pi(\log(\sigma^2))\propto 1 \] then \[ \pi(\sigma^2)\propto\frac{1}{\sigma^2} \] the joint posterior is \[ \pi(\alpha,\beta,\sigma^2|\mathbf{y})\propto\left(\frac{1}{2\pi\sigma^2}\right)^{n/2}\exp\left(-\frac{\sum_{j=1}^n(\alpha-\beta(x_j-\bar{x})-y_j)^2}{2\sigma^2}\right)\frac{1}{\sigma^2} \] We can immediately write that \[ \pi(\sigma^2|\alpha,\beta,\mathbf{y})\propto (\sigma^2)^{-n/2-1}\exp\left(-\frac{1}{\sigma^2}\frac{\sum_{j=1}^n(\alpha-\beta(x_j-\bar{x})-y_j)^2}{2}\right) \] which implies that \[ \sigma^2|\alpha,\beta,\mathbf{y}\sim\operatorname{InvGa}\left(\frac{n}{2},\frac{\sum_{j=1}^n(\alpha-\beta(x_j-\bar{x})-y_j)^2}{2}\right) \] which it can be shown that this is the equivalent to \[ \frac{\sum_{j=1}^n(\alpha-\beta(x_j-\bar{x})-y_j)^2}{\sigma^2}\sim\chi^2_{n}. \] The likelihood can be re-written as \[ \alpha|\beta,\sigma^2,\mathbf{y}\sim N\left(\frac{\sum_{j=1}^n(y_j-\beta(x_j-\bar{x}))}{n},\frac{\sigma^2}{n}\right) \] noting that \(\sum_{j=1}^n(x_j-\bar{x})=0\) this simplifies to \[ \alpha|\sigma^2,\mathbf{y}\sim N\left(\bar{y},\frac{\sigma^2}{n}\right). \] Note that in this parameterisation, \(\alpha\) and \(\beta\) are independent, thus a naive candidate for a RWMH would be reasonable (provided the proposal variances are appropriately scaled). Finally for \(\beta|\alpha,\sigma^2,\mathbf{y}\), we expand the quadratic term in the likelihood and find that \[ \operatorname{Var}(\beta|\sigma^2,\mathbf{y})=\frac{\sigma^2}{\sum_{j=1}^n(x_j-\bar{x})^2} \] and \[ \operatorname{E}(\beta|\alpha,\sigma^2,\mathbf{y})=\frac{\sum_{j=1}^n(x_j-\bar{x})(y_j-\alpha)} {\sum_{j=1}^n(x_j-\bar{x})^2} \] noting that \(\alpha\sum_{j=1}^n(x_j-\bar{x})=0\) this simplifies to \[ \beta|\sigma^2,\mathbf{y}\sim N\left(\frac{\sum_{j=1}(x_j-\bar{x})y_j}{\sum_{j=1}^n(x_j-\bar{x})^2},\frac{\sigma^2}{\sum_{j=1}^n(x_j-\bar{x})^2}\right) \]

Practical questions

Question 3

Implement a Gibbs sampler for the model in question 1 (theory) using R.

Write a one-dimensional MH sampler for the conditional distribution \((\tau~\vert~\mu,\alpha, \boldsymbol{y})\). Check that it works.
Write a Gibbs sampler for \((\mu,\alpha~\vert~\tau, \boldsymbol{y})\). Check that it works as expected.
Write a MH-within-Gibbs sampler for the full distribution using the solutions to (a) and (b). The Gibbs step for \(\tau\) should simply use the MH step in (a).

Finally, comment on the performance of these algorithms


# (a)


# (b)


library(extraDistr)

# the log density of tau given alpha and mu up to a constant
log_dens_tau_given_all <- function(tau, alpha, mu, y, n){
  
  sum( dnorm(x = y, mean = mu, sd = sqrt(tau/2), log = T) ) - alpha * tau
  
}

# going to use an truncated normal proposal (tau has to be positive)


one_step_mh_tau <- function(current_tau, scale, alpha, mu, y, n){
  
  proposal_tau <- rtnorm(n = 1, mean = current_tau, sd = scale, a = 0) 
  
  # proposal densities
  q_log_dens_proposal_given_current <- 
    dtnorm(x = proposal_tau, mean = current_tau, sd = scale, a = 0)
  
  q_log_dens_current_given_proposal <- 
    dtnorm(x = current_tau, mean = proposal_tau, sd = scale, a = 0)
  
  # posterior densities
  pi_log_dens_tau_proposal <- 
    log_dens_tau_given_all(tau = proposal_tau, 
                           alpha = alpha,
                           mu = mu, 
                           y = y, n = n)
  
  pi_log_dens_tau_current <- 
    log_dens_tau_given_all(tau = current_tau, 
                           alpha = alpha, 
                           mu = mu, 
                           y = y, n = n)
  
  log_mh_ratio <- 
    pi_log_dens_tau_proposal - q_log_dens_proposal_given_current - 
    (pi_log_dens_tau_current - q_log_dens_current_given_proposal)
    
  accept_ratio <- min(1, exp(log_mh_ratio) )
  
  if(runif(1) < accept_ratio){
    return(proposal_tau)
  } else {
    return(current_tau)
  }
  
}

# check if works:
tau_chain <- rep(NA_real_, times = 10000)
tau_chain[1] <- 0.01
for(i in 2:length(tau_chain)){
  
 tau_chain[i] <-  
  one_step_mh_tau(current_tau = tau_chain[i-1], 
                  scale = 1.5, 
                  alpha = 1, 
                  mu = 1, 
                  y = rnorm(n = 10, mean = 1, sd = 1), n = 10)

}

# accept prop
mean( diff(tau_chain) != 0)

plot(tau_chain, type = "l")


# simulate some data
y <- rnorm(n = 100, mean = 3, sd = 1)
y_bar <- mean(y)
n <- length(y)

samples <- matrix(NA_real_, nrow = 10000, ncol = 3)
colnames(samples) <- c("alpha","mu","tau")

samples[1,] <- c(1,2,1)

# algorithm tuning:
MH_scale <- 1

for(i in 2:nrow(samples)){
  
  # update alpha
  samples[i,"alpha"] <- rexp(n = 1, rate = samples[i-1,"tau"] + 1)
  
  # update mu
  samples[i,"mu"] <- rnorm(n = 1, mean = y_bar, sd = sqrt(samples[i-1,"tau"] / (2 * n) ) )
  
  # update tau
  samples[i,"tau"] <- one_step_mh_tau(
    current_tau = samples[i-1,"tau"], 
    scale = MH_scale, 
    alpha = samples[i,"alpha"] , 
    mu = samples[i,"mu"], 
    y = y,
    n = n)
  
}

Question 4

Given \[ y=-3.48, -2.82, -1.96, 0.48, -2.59, 2.48, 3.06, 4.05, 3.80, 6.12, 5.98 \] and \[ x = -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5 \] Implement a Gibbs sampler for the model in question 2 (theory) using R.

Write a one-dimensional MH sampler for the conditional distribution \((\sigma^2~\vert~\beta,\alpha, \boldsymbol{y})\) assuming that the prior \[ \pi(\sigma^2)=\frac{1}{(1+\sigma^2)^2} \] Check that it works.
Write a Gibbs sampler for \((\beta,\alpha~\vert~\sigma^2, \boldsymbol{y})\). Check that it works as expected.
Write a MH-within-Gibbs sampler for the full distribution using the solutions to (a) and (b). The Gibbs step for \(\sigma^2\) should simply use the MH step in (a).

Finally, comment on the performance of these algorithms.


# (a) here


# (b) here


# (c) here

x<-c(-5,-4,-3,-2,-1,0,1,2,3,4,5)
y<-c( -3.48, -2.82, -1.96,  0.48, -2.59,  2.48,  3.06,  4.05,  3.80,  6.12,  5.98)
xbar<-mean(x)

log_like<-function(alpha,beta,sigma2)
{
  dnorm(y, mean = alpha+beta*(x-xbar),sd=sqrt(sigma2),log = TRUE)%>%sum()
}

##  Fix beta=alpha=1

beta<-1
alpha<-1

n<-1000

s2<-numeric(length = n)

s2[1]<-1

for(i in 2:n)
{
  sc<-rexp(1,rate = 1/s2[i-1])
  num<-log_like(alpha,beta,sc)-2*log(sc+1)-s2[i-1]/sc
  den<-log_like(alpha,beta,s2[i-1])-2*log(s2[i-1]+1)-sc/s2[i-1]
  a<-(num-den)%>%exp()
  u<-runif(1)
  if(u<a)
  {
    s2[i]<-sc
  }else
    s2[i]<-s2[i-1]
}


x<-c(-5,-4,-3,-2,-1,0,1,2,3,4,5)
y<-c( -3.48, -2.82, -1.96,  0.48, -2.59,  2.48,  3.06,  4.05,  3.80,  6.12,  5.98)
xbar<-mean(x)

log_like<-function(alpha,beta,sigma2)
{
  dnorm(y, mean = alpha+beta*(x-xbar),sd=sqrt(sigma2),log = TRUE)%>%sum()
}

n<-1000

##  Fix s2<-1

s2<-1

beta<-numeric(length = n)
alpha<-numeric(length = n)
beta[1]<-0
alpha[1]<-0

for(i in 2:n)
{
  bc<-rnorm(1,mean=beta[i-1],sd=0.3)
  ac<-rnorm(1,mean=alpha[i-1],sd=0.3)
  num<-log_like(ac,bc,s2)
  den<-log_like(alpha[i-1],beta[i-1],s2)
  a<-(num-den)%>%exp()
  u<-runif(1)
  if(u<a)
  {
    alpha[i]<-ac
    beta[i]<-bc
  }else
    {
    alpha[i]<-alpha[i-1]
    beta[i]<-beta[i-1]
    }  
}


x<-c(-5,-4,-3,-2,-1,0,1,2,3,4,5)
y<-c( -3.48, -2.82, -1.96,  0.48, -2.59,  2.48,  3.06,  4.05,  3.80,  6.12,  5.98)
xbar<-mean(x)

log_like<-function(alpha,beta,sigma2)
{
  dnorm(y, mean = alpha+beta*(x-xbar),sd=sqrt(sigma2),log = TRUE)%>%sum()
}

n<-1000

beta<-numeric(length = n)
alpha<-numeric(length = n)
s2<-numeric(length = n)
beta[1]<-0
alpha[1]<-0
s2[1]<-1

for(i in 2:n)
{
  bc<-rnorm(1,mean=beta[i-1],sd=0.05)
  ac<-rnorm(1,mean=alpha[i-1],sd=0.05)
  num<-log_like(ac,bc,s2[i-1])
  den<-log_like(alpha[i-1],beta[i-1],s2[i-1])
  a<-(num-den)%>%exp()
  u<-runif(1)
  if(u<a)
  {
    alpha[i]<-ac
    beta[i]<-bc
  }else
    {
    alpha[i]<-alpha[i-1]
    beta[i]<-beta[i-1]
    }  
  sc<-rexp(1,rate = 1/s2[i-1])
  num<-log_like(alpha[i],beta[i],sc)-2*log(sc+1)-s2[i-1]/sc
  den<-log_like(alpha[i],beta[i],s2[i-1])-2*log(s2[i-1]+1)-sc/s2[i-1]
  a<-(num-den)%>%exp()
  u<-runif(1)
  if(u<a)
  {
    s2[i]<-sc
  }else
    s2[i]<-s2[i-1]
}