“Most of us have grown so blase about computer developments and capabilities — even some that are spectacular — that it is difficult to believe or imagine there was a time when we suffered the noisy, painstakingly slow, electromechanical devices that chomped away on punched cards.” — Nick Metropolis

Week 9: Bayesian Inference pt 2

Sampling Based Methods for Bayesian Results

Monte Carlo Integration

The basic idea of Monte Carlo integration is \[ \int_0^1f(x)dx\approx\frac{1}{N}\sum_{i=1}^Nf(x_i),\:x_i\stackrel{iid}{\sim}U(0,1) \] or that the value of the integral is the expected value of the probability density function with respect to a uniform density over the range of integration

This integral, or expectation, can be approximated by the sample mean of the density function evaluated at random values drawn uniformly over \((0,1)\). The sample mean converges to the true value and is an unbiased estimator of the integrand.

This can be generalised over any interval \[ \int_a^bf(x)dx\approx(b-a)\frac{1}{N}\sum_{i=1}^Nf(x_i),\:x_i\stackrel{iid}{\sim}U(a,b). \]

Rather than sampling from a uniform distribution and evaluating the probability density function, if samples can be drawn from the probability density function, then one can write the integral in terms of the sample expectation of the indicator function \[ \int_a^bf(x)dx\approx(b-a)\frac{1}{N}\sum_{i=1}^Nf(x_i),\:x_i\stackrel{iid}{\sim}U(a,b)\\ \equiv \frac{1}{N}\sum_{i=1}^N\mathbb{I}_{x_i\in(a,b)},\:x_i\stackrel{iid}{\sim}f(x). \] and for any function \(h(x)\) \[ E(h(x))\approx\frac{1}{N}\sum_{i=1}^Nh(x_i),\:x_i\stackrel{iid}{\sim}f(x). \]

The Fundamental Theorem of Simulation

One definition of the Fundamental Theorem of Simulation is that \[ F(x)\sim U(0,1). \] or that if \(x\) is is a realisation of the random vaiable \(X\sim f(x)\) then cdf \(F(x)\) evaluated at \(x\) follows a uniform distribution over the interval \((0,1)\).

Note the identity for any density \(f\): \[ f(x)=\int_0^{f(x)}1du \] this is the equivalent of saying that \(f\) is the marginal density of \(X\) for the joint distribution: \[ (X,U)\sim\{(x,u):0<u<f(x)\}. \]

Rejection Sampling

We can generalise the algorithm developed above and state that given a target distribution for some random variable \(Y\) with density function \(g(y)\), possibly known up to a proportionality constant, (i.e. \(\int_Yg(y)dy\propto 1\)), we can generate samples from \(g(y)\) by drawing samples from some arbitrary but known probability distribution for \(X\) with density function \(f(x)\) and accepting the generate \(x\) as a sample from \(g(y)\) with the probability \(g(x)/Mf(x)\).

In the actual implementation, we initialise the algorithm by identifying some candidate distribution \(f(x)\) that we can readily use to generate random samples and compute \(M\) such that \[ g(y)\leq Mf(y)\: \forall\, y\in Y \] meaning that \[ M = \sup_Y \frac{g(y)}{f(y)}. \] Then the resulting algorithm is

Draw a sample \(x\) from the candidate density \(f(x)\) and a random sample \(u\sim U(0,1)\).
Compute \(u^*= \frac{g(x)}{Mf(x)}\)
If \(u\leq u^*\) accept \(x\) as a random sample from \(g(\cdot)\). Else reject \(x\)
Repeat until the desired number of samples is reached.

Provided that both densities are normalised, on average, it will take \(M\) draws from \(X\) to generate 1 sample from \(Y\).

MCMC and the Metropolis-Hastings Algorithm

For some some target density \(f(x)\) let \(g(x)\propto f(x)\).

Pick and initial value for \(x_0\)
for \(t\) in \(1:T\):
Given an current value of \(x_t\) choose an arbitrary distribution \(q\) that has the same support as \(f\) and is symmetric, i.e. \(q(x|y)=q(y|x)\) and generate a proposed new value \(x'\).
Calculate the acceptance probability: \[\alpha=\min\left(1,\frac{g(x')}{g(x_t)}\right)\]
Accept \(x_{t+1}=x'\) with probability \(\alpha\), else \(x_{t+1}=x_t\).

\[ \alpha=\min\left(1,\frac{g(x')q(x_t|x')}{g(x_t)q(x'|x_t)}\right) \]

It is important to note that working with the function \(g\) instead of the proper density works because the normalising constant will be the same for \(f(x_t)\) and \(f(x')\), thus they would cancel out in the ratio, making the acceptance probability the same if calculated with \(f\) or \(g\). In a Bayesian context, this is useful because if \(g(\theta) = f(\mathbf{y}|\theta)\pi(\theta)\) from Bayes theorem, then it the Metropolis-Hastings algorithm for drawing samples from a posterior distribution is

Pick and initial value for \(\theta_0\)
for \(t\) in \(1:T\):
Given an current value of \(\theta_t\) choose an arbitrary distribution \(q\) that has the same support as \(\pi(\theta|\mathbf{y})\) to generate a proposed new value \(\theta'\).
Calculate the acceptance probability: \[\alpha=\min\left(1,\frac{f(\mathbf{y}|\theta')\pi(\theta')q(\theta_t|\theta')}{f(\mathbf{y}|\theta_t)\pi(\theta_t)q(\theta'|\theta_t)}\right)\]
Accept \(\theta_{t+1}=\theta'\) with probability \(\alpha\), else \(\theta_{t+1}=\theta_t\).

Running these algorithms for time \(T\) results in a sample of size \(T\) of random samples from the target density function.

The Random Walk Metropolis-Hastings Algorithm

Pick and initial value for \(x_1\)
for \(t\) in \(1:T\):
Choose a distribution \(q(x'|x_t)\) such that \(E(x'|x_t)=x_t\) that has the same support as \(f\) and generate a proposed new value \(x'\).
Calculate the acceptance probability: \[\alpha=\min\left(1,\frac{g(x')q(x_t|x')}{g(x_t)q(x'|x_t)}\right)\]
Accept \(x_{t+1}=x'\) with probability \(\alpha\), else \(x_{t+1}=x_t\).

\[ \alpha=\min\left(1,\frac{g(x')q(x_t|x')}{g(x_t)q(x'|x_t)}\right) \]

Theory questions

Question 1

Antibiotics are sometimes taken by people who have influenza symptoms although flu is caused by a virus. Various controlled comparative trials have been undertaken to investigate the efficacy of antibiotics for various diseases. It is expected, for example, that antibiotics would have little effect in reducing ’flu symptoms.

We have collected data on the times from infection until symptoms disappear for a given disease when patients are treated using the standard treatment \(S\) and using a new treatment \(N\). We assume exponentially distributed times Exp\((\theta_1)\) for the standard and Exp\((\theta_2)\) for the new treatment. We assume a priori that \(\theta_1\) and \(\theta_2\) are independent both with Gamma\((c,c)\) distributions, for some constant \(c\). Suppose there are \(n_{1}=10\) patients for the standard treatment with sample mean \(\bar{y}_1=6.5\) days, whilst there are \(n_{2}=15\) patients for the new treatment with sample mean \(\bar{y}_2=5.0\) days. Let \(c = 10^{-3}\).

Show that the likelihood
\[ p(\mathbf{y}_{1},\mathbf{y}_{2}~\vert~\theta_{1},\theta_{2}) \propto \theta_1^{n_{1}} \exp\{-\theta_1 n_{1} \bar{y}_1\} \times \theta_2^{n_{2}} \exp\{-\theta_2 n_{2} \bar{y}_2\} \] and that the posterior for \(\theta_1\) and \(\theta_2\) is given by \[ p(\theta_1,\theta_2,|\mathbf{y}) \propto \theta_1^{n_{1}+c-1} \exp\{-\theta_1 (n_{1} \bar{y}_1+c)\} \times \theta_2^{n_{2}+c-1} \exp\{-\theta_2 (n_{2} \bar{y}_2+c)\}, \] i.e. the posterior distributions for \(\theta_1\) and \(\theta_2\) are given by independent Gamma\((n_{1}+c,n_{1}\bar{y}_1+c)\) and Gamma\((n_{2}+c,n_{2}\bar{y}_2+c)\) random variables, respectively.

Question 2

When we use Monte Carlo simulation methods to estimate a quantity, usually a mean in statistics, then we should take into account the accuracy of the estimate by finding a standard error of the estimate. The standard error can be reduced by taking more Monte Carlo simulations. The accuracy of the estimate can be defined in terms of a fraction or percentage of the mean, \(\frac{\text{SE}(\hat{\mu})}{\hat{\mu}} = 1\%\). This question is about using sufficient Monte Carlo samples to obtain a given accuracy.

Let \(\theta_1\) be the probability that treatment \(ABC\) for an injured knee is successful. Suppose I assess my uncertainty about \(\theta_1\) as represented by a Beta \((20,30)\) distribution. That is, \(\theta_1\) behaves like a random variable.

By simulating \(1000\) values from the Beta \((20,30)\) distribution, estimate the means of \(\theta_1\) and \(\tau = \theta_1/(1-\theta_1)\), the odds of success. How accurate are these estimates?¹
If the mean of \(\theta_1\) is to be estimated with \(1\%\) accuracy (or standard error of estimate \(= 0.01 \times\) estimated value) or better, are \(1000\) values sufficient in number? How many values should be used in this simulation? Now repeat the same calculation for the mean of the odds \(\theta_1/(1-\theta_1)\). Compare the sample sizes required.
Let \(\theta_2\) be the probability that treatment \(XYZ\) is successful. My uncertainty about \(\theta_2\) is represented by a Beta\((30,20)\) distribution, independent of \(\theta_1\). By simulating \(1000\) pairs of values of \((\theta_1,\theta_2)\), estimate the means of \(\theta_2-\theta_1\), \(\theta_2/\theta_1\), \(\frac{\theta_2/(1-\theta_2)}{\theta_1/(1-\theta_1)}\) (the odds ratio success for treatment \(XYZ\) to treatment \(ABC\)), estimate the lower and upper \(25\%\) percentiles of each of these distributions. What is my belief that treatment \(XYZ\) is better than treatment \(ABC\)?
Repeat parts (a)-(c) with \(\theta_1 \sim \text{Beta}(2,3)\) and \(\theta_2\sim \text{Beta}(3,2)\). Do these distributions represent more or less certainty than above? Explain why the final conclusion about treatment \(XYZ\) being better than treatment \(ABC\) is different.

# (a)

# (b)

# (c)

# (d)


N <- 1000
a <- 20
b <- 30

var_beta <- a * b / ( ((a + b)^2) * (a + b + 1) )
var_beta_prime <- a * (a + b - 1) / ( (b - 2) * (b - 1)^2 )
  
# standard error of \\hat{\\theta}:
se_theta_hat <- sqrt(var_beta / N)

# standard error of \\hat{\\tau}:
se_tau_hat <- sqrt(var_beta_prime / N)

# calculate means:
x <- rbeta(n = N, shape1 = a, shape2 = b)

theta1_hat <- mean( x )
approx_se_theta1_hat <- sqrt( var(x)/N )
  
tau_hat <- mean( x / (1 - x) )
approx_se_tau_hat <- sqrt( var(x / (1 - x))/N )

cat("Theta1 estimate:\\n")
cat(theta1_hat,", true SE =", se_theta_hat, ", approximate SE = ", approx_se_theta1_hat)

cat("Tau estimate:\\n")
cat(tau_hat, ", true SE =", se_tau_hat, ", approximate SE = ", approx_se_tau_hat)


# use true or approximate SE...
se_theta_hat / theta1_hat

se_tau_hat / tau_hat

# both less than 1% currently. 
# Change N to determine required samples for desired accuracy.


theta1 <- rbeta(n = N, shape1 = a, shape2 = b)
theta2 <- rbeta(n = N, shape1 = b, shape2 = a)

theta2_minus_theta1 <- theta2 - theta1
theta2_div_theta1 <- theta2 / theta1
odds_ratio <- (theta2 / (1 - theta2)) / (theta1 / (1 - theta1))

mean(theta2_minus_theta1)
mean(theta2_div_theta1)
mean(odds_ratio)

quantile(theta2_minus_theta1, probs = c(0.25,0.75))
quantile(theta2_div_theta1, probs = c(0.25,0.75))
quantile(odds_ratio, probs = c(0.25,0.75))


theta1 <- rbeta(N, 2, 3, ncp = 0)
theta2 <- rbeta(N, 3, 2, ncp = 0)

theta_est <- mean(theta1)       # theoretical value = 0.4
theta_se <- sqrt(var(theta1)/N)
theta_se/theta_est
if(theta_se/theta_est > 0.01){
  n <- N
  while(theta_se/theta_est > 0.01){
    n <- n + 1
    theta_new <- rbeta(n, 2, 3, ncp = 0)
    theta_est <- mean(theta_new) 
    theta_se <- sqrt(var(theta_new)/n)
  }
}
n


odds_est <- mean(theta1/(1-theta1)) 
odds_se <- sqrt(var((theta1/(1-theta1)))/N)
odds_se/odds_est
if(odds_se/odds_est > 0.01){
  n <- N
  while(odds_se/odds_est > 0.01){
    n <- n + 10
    theta_new <- rbeta(n, 2, 3, ncp = 0)
    odds_est <- mean(theta_new/(1-theta_new))   
    odds_se <- sqrt(var((theta_new/(1-theta_new)))/n)
  }
}
n


theta2_minus_theta1 <- theta2 - theta1
mean(theta2_minus_theta1)
quantile(theta2_minus_theta1, probs = c(0.25,0.75))

theta2_div_theta1 <- theta2/theta1
mean(theta2_div_theta1)
quantile(theta2_div_theta1, probs = c(0.25,0.75))

theta_odds <- (theta2 / (1 - theta2)) / (theta1 / (1 - theta1))
mean(theta_odds)
quantile(theta_odds, probs = c(0.25,0.75))

Hint: Find the standard errors of the estimates. Note that \(\tau\) is distributed according to a Beta prime distribution.↩︎

Practical questions

Question 3

This question continues from Question 1 (theory).

By simulating pairs of values of \(\theta_1,\theta_2\) from the posterior \(p(\theta_1,\theta_2|\mathbf{y})\) estimate the posterior distribution of \(r = \theta_1^{-1}-\theta_2^{-1}\), the mean reduction in infection time until symptom free, comparing the new treatment with the standard.
What is the approximate probability that \(p(r \leq 1 ~\vert~ \boldsymbol{y})\)?
Write a random-walk Metropolis-Hastings (RWMH) algorithm to simulate from \(p(\theta_1,\theta_2|\mathbf{y})\) using
1. Truncated normal proposal (truncated at zero to avoid having negative proposals).
2. Pareto proposal distribution, with support on \([0,\infty)\).
Approximate \(E(r)\) using the samples from the two RWMH for a fixed number of samples. Compare to i.i.d Gamma samples from the posterior. Approximate how large \(N\) needs to be before you the three estimates are indistinguishable (use graphical methods).
Evaluate the performance of the RWMH algorithms using trace, running mean, and (partial) auto-correlation plots, as well as effective sample size approximations.² Based on these metrics, should you run your sampler for a longer than \(N=3000\) iterations?
What proportion of iterations were accepted (by the MH ratio)? Is this proportion close to optimal?


# (a) and (b)


n <- c(10, 15)
y_bar <- c(6.5,5.0)

sim_gamma_posterior <- function(n_sim, n_sample, y_bar){
  
  cbind(
    rgamma(n = n_sim, shape = n[1] + 10e-03, rate = n[1]*y_bar[1] + 10e-03),
    rgamma(n = n_sim, shape = n[2] + 10e-03, rate = n[2]*y_bar[2] + 10e-03)
  )
  
}

exact_samples <- sim_gamma_posterior(n_sim = 5000, n_sample = n, y_bar = y_bar)

# (a) approximate distribution of r = \\theta_1^{-1}-\\theta_2^{-1}:

r_sim_exact <- exact_samples[,1]^(-1) - exact_samples[,2]^(-1)

hist(r_sim_exact)

# (b) P(r <= 1 | y)
mean(r_sim_exact <= 1)


# (c)


# (c)(i) This code is for the RWMH algorithm with truncated normal proposal

library(extraDistr)

gamma2_log_posterior_prop <- 
  function(theta1, theta2, 
          n1, n2, y1_mean, y2_mean, 
          alpha_prior = 10e-03, beta_prior = 10e-03){
  
  if(theta1 <= 0 | theta2 <= 0) return(-Inf)
  
  (n1 + alpha_prior - 1) * log( theta1 ) - theta1 * (y1_mean * n1 + beta_prior) +
    (n2 + alpha_prior - 1) * log( theta2 ) - theta2 * (y2_mean * n2 + beta_prior)
  
}

n1 <- 10
y1_mean <- 6.5
n2 <- 15
y2_mean <- 5.0

gamma2_log_posterior_prop(theta1 = 1, theta2 = 1,
                          n1 = n1, n2 = n2, 
                          y1_mean = y1_mean, y2_mean = y2_mean)


rwmh_sampler_gibbs <- function(n_samples, kernel_scale){
  
  # storage and initialise samples
  samples <- matrix(0, nrow = n_samples, ncol = 2)
  samples[1,] <- 1
  
  sum_accept <- 0
  
  for(t in 2:n_samples){
    
    # sample proposal
    proposal <- rtnorm(n = 2, mean = samples[t-1,], sd = kernel_scale, a = 0)
    
    # calculate densities at proposal and current position
    log_proposal_given_current_dens <- 
      sum(dtnorm(proposal, mean = samples[t-1,], sd = kernel_scale, a = 0, log = TRUE))
    
    log_current_given_proposal_dens <- 
      sum(dtnorm(samples[t-1,], mean = proposal, sd = kernel_scale, a = 0, log = TRUE))
    
    log_posterior_proposal_density <- 
      gamma2_log_posterior_prop(theta1 = proposal[1], theta2 = proposal[2],
                            n1 = n1, n2 = n2, 
                            y1_mean = y1_mean, y2_mean = y2_mean)
    
    log_posterior_current_density <- 
      gamma2_log_posterior_prop(theta1 = samples[t-1,1], theta2 = samples[t-1,2],
                            n1 = n1, n2 = n2, 
                            y1_mean = y1_mean, y2_mean = y2_mean)
    
    log_ratio <- log_posterior_proposal_density + log_current_given_proposal_dens - 
      (log_posterior_current_density + log_proposal_given_current_dens)
    
    accept <- exp(log_ratio) > runif(1)
    if(accept){
      samples[t,] <- proposal
    } else {
      samples[t,] <- samples[t-1,]
    }
  
    sum_accept <- sum_accept + accept
    
  }
   
  cat("Proportion accepted:", round(sum_accept/(n_samples-1),2) )
   
  return(samples)
  
}

samples <- rwmh_sampler_gibbs(n_samples=10000, kernel_scale=0.1)

samples_post_burnin <- samples[5001:10000,]


# (d)


# (d)

# running mean of samples:
plot(
  cumsum(samples_post_burnin[,1])/(1:nrow(samples_post_burnin)),
  type = "l", xlab = "Iteration", ylab = "Theta1")

plot(
  cumsum(samples_post_burnin[,2])/(1:nrow(samples_post_burnin)), 
  type = "l", xlab = "Iteration", ylab = "Theta2")

# compared to 
mean(r_sim_exact)


# (e)


# trace plot

plot(samples_post_burnin[,1], type = "l",  
     xlab = "Iteration", ylab = "Theta1")
plot(samples_post_burnin[,2], type = "l",  
     xlab = "Iteration", ylab = "Theta2")

# effective sample size
library(coda)

effectiveSize(samples_post_burnin[,1])
effectiveSize(samples_post_burnin[,2])

# (p)acf

pacf(samples_post_burnin[,1])

#pacf(samples_post_burnin[,2])

Hint: [p]acf() and effectiveSize() in the coda package can assist you. Don’t forget to discard the first part of your sample as burn-in.↩︎