The researcher hoping to break new ground in the theory of experimental design should involve himself in the design of actual experiments. The investigator who hopes to revolutionize decision theory should observe and take part in the making of important decisions.

— George E. P. Box (1976)¹

Week 6: Decision Theory

Decision theory is a broad topic that can encompass ideas from philosophy, psychology, economics, operations research, applied mathematics, and statistics. Broadly speaking, decision theory is the study of how individuals or agents make choices. In this unit, we focus on the statistical aspect of how to use data to make decisions under various circumstances.

Statistical Decision Theory

Statistical decision theory is the process of making and evaluating decisions under uncertainty in terms of their “cost” or the “loss” incurred by making a given decision. Statistical decision theory also attempts to create a unifying framework for theoretical statistics and statistical inference beyond classical statistical methods.

Risk and Minimax Rules

The basic elements of decision theory are:

Observed data \(\mathbf{x}\) with probability distribution \(X\sim f(x;\theta)\).
The action \(a\), chosen from the set of all possible actions, i.e. \(a\in A\).
The state of nature represented by the unknown parameter(s) \(\theta\).
The decision (or choice of \(a\)) defined by the function \(a=d(\mathbf{x})\)
A loss function \(l(\theta,d(\mathbf{x}))\) used to evaluate decisions and choose actions.

For a given situation, the risk function is defined as the expected value of the loss function with respect to the probability distribution of the data, i.e.
\[ R(\theta,d)=E\left(l(\theta,d(\mathbf{x}))\right)=\int_{\mathbf{X}}l(\theta,d(\mathbf{x})f(\mathbf{x}|\theta)d\mathbf{x}. \] This definition is both intuitive and straightforward; however, the risk depends on the state of nature \(\theta\) and the decision function \(d(\cdot)\). Our goal is not only to evaluate a single decision rule but to choose an optimal decision rule and account for the uncertainty associated with the unknown state of nature.

The minimax rule is designed to offer a very conservative means of choosing a decision by stating that the optimal decision under the minimax rule is the one which has the smallest (over the decision space) maximum risk (over the state of nature space). Or in other words, it chooses the option with the best worst outcome.

The Minimax Rule
The minimax rule selects the decision \(d^*\) in the decision space \(D\) that minimises the maximum risk over the possible states of nature \(\theta\in\Theta\) \[ d^*=d:\min_{d\in D}\left\{\max_{\theta\in\Theta}R(\theta,d)\right\}. \] The minimax rule focuses on minimising the worst possible outcome rather than optimising some other criteria. This property makes the minimax rule very conservative in choosing actions.

Science and Statistics, 1976, p. 792↩︎

Choice of Loss Function

The loss function is the connection between decisions and their consequences. The loss function form can determine the computational difficulties that might arise in applying a rule for finding the optimal decision rule.

In cases where the loss incurred is expressed as a utility (e.g. profit), the loss function measures the negative of the utility, and we construct the loss function to reflect the level of risk aversion (or risk-seeking) propensity of the decision-maker.

In optimisation, it is desirable to have a loss function that globally continuous and differentiable. Common choices of loss function in these situations are squared-error loss or \(L_2\) loss \[ l(\theta,d(\mathbf{x}))=(\theta_0-\hat{\theta})^2 \] where the true state of nature is \(\theta_0\) and the decision rule is \(d(\mathbf{x})=\hat{\theta}\). Squared-error loss has the nice property of being continuous and smooth (differentiable), but outliers can overly influence results. Alternatively the absolute value loss function.n or \(L_1\) loss \[ l(\theta,d(\mathbf{x}))=|\theta_0-\hat{\theta}| \] avoids the undue influence of outliers but is not differentiable at \(\hat{\theta}=0\), and is more difficult to optimise than squared-error loss. In some cases, it is desirable to use \(0-1\) or \(L_0\) loss \[ l(\theta,d(\mathbf{x}))=\left\{ \begin{array}{ll} 1,&\hat{\theta}=\theta_0\\ 0,&\hat{\theta}\neq\theta_0 \end{array} \right. \] which requires numerical evaluation.

Paramter Estimation

We now consider a decision-theoretic approach called loss function optimality in which we choose our estimator or decision function based on the minimisation of the loss function.

We have already seen this when we considered the least-squares estimator, which is the \(L_2\) loss function. Given a random sample of size \(n\) \(\mathbf{x}=x_1,x_2,\ldots,x_n\) from the probability distribution of the random variable \(X\) we define our decision rule as the estimator of the expected value of \(X\), under \(L_2\) loss the optimal estimator \(d^*(\mathbf{x})\) (i.e. the one that minimises loss) is \[ \begin{align} d^*(\mathbf{x})&=\underset{\theta\in \Theta}{\operatorname{argmin}}\sum_{i}^n\left(x_i-g(\theta)\right)^2\\ &=\underset{\theta\in \Theta}{\operatorname{argmin}}||\mathbf{x}-g(\theta)||_2 \end{align} \] In the case of \(L_1\) or median loss \[ d^*(\mathbf{x})=\underset{\theta\in \Theta}{\operatorname{argmin}}||\mathbf{x}-\theta||_1. \] the optimal estimator of \(\theta_0\) is the sample median. For \(L_0\) loss or \(0-1\) loss \[ d^*(\mathbf{x})=\underset{\theta\in \Theta}{\operatorname{argmin}}||\mathbf{x}-\theta||_0. \] the optimal estimator of \(\theta_0\) is the sample mode or the most frequently occurring value.

We can think of the maximum likelihood estimator as a loss function optimality problem where the loss function is the negative of the likelihood function \[ l(\theta,d(\mathbf{x}))=-L(\theta|\mathbf{x{}}) \] in this case, the minimiser of the loss function is the maximum likelihood estimator.

The loss functions \(L_2\), \(L_1\), and \(L_0\) are some of the more commonly used objective functions in optimisation, but the set of possible loss functions is very large. Using loss function optimality, we can derive estimators for specific situations where concepts like likelihood or least-squares may not apply directly to the problem.

Theory questions

Question 1

An investor has 1000 dollars to invest in speculative stocks. The investor is considering investing \(d\) dollars (a decision) in stock A and \((1000 - d)\) dollars in stock B. An investment in stock A has a \(0.6\) chance of doubling in value, and a \(0.4\) chance of being lost. An investment in stock B has a \(0.7\) chance of doubling in value, and a \(0.3\) chance of being lost. The investor’s loss function for a change in fortune, \(z\), is \(L(\theta, z) = -\theta\log(0.0007 z + 1)\) for \(-1000 \leq z \leq 1000\) and \(1 \leq \theta \leq 2\). The variable \(\theta\) is just an unknown parameter of the loss function in this example².

What is the set of all possible outcomes, \(\mathcal{Z}\), for a fixed \(d\)? (It consists of four elements.)
What is the expected loss, or risk function?
Determine the minimax decision rule \(d^{\star}_{m}\).
Hint: Plotting the risk function for fixed \(\theta\) in R, and find its root may be helpful.

library(ggplot2)
risk <- function(d){
  - (
    0.12 * log(0.3) +
    0.18 * log(0.0014 * d + 0.3) +
    0.28 * log(1.7 - 0.0014 * d) +
    0.42 * log(1.7)
  )
  
}

ggplot(data.frame(d = c(0,1000)),aes(x = d)) + 
  stat_function(fun = risk) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_vline(xintercept = 776.96, linetype = 2) +
  ylab("r(d)")

Solution:
a. The set of all possible outcomes (change in fortunes): \(\mathcal{Z} = \{-1000, 2d - 1000, 1000 - 2d, 1000\}\) whose elements corresponds to "both fail", "stock A succeeds, stock B fails", "stock A fails, stock B succeeds", "both succeed".

b. The probabilities associated with each outcome in \(\mathcal{Z}\) are \[\{0.4\times 0.3, 0.6\times 0.3, 0.4\times 0.7, 0.6\times 0.7\}=\{0.12, 0.18, 0.28, 0.42\}.\] The values of the loss funtion \(L(\theta, z) = -\log(0.0007\theta z + 1)\) for each outcome is \[-\theta\log(0.3), -\theta\log(0.0014d + 0.3), -\theta\log(1.7 - 0.0014d), -\theta\log(1.7),\] therefore the expected loss (or risk function) is \[ R(\theta, d) = -\theta\left[0.12\log(0.3)+0.18\log(0.0014d+0.3)+0.28\log(1.7-0.0014d)+0.42\log(1.7)\right] \]

c. Note that the risk function can be written as \(R(\theta, d) = \theta r(d)\) where \[ r(d) = -\left[0.12\log(0.3)+0.18\log(0.0014d+0.3)+0.28\log(1.7-0.0014d)+0.42\log(1.7)\right] \] Plotting the function \(r(d)\) we can see that \(r(d) < 0\) for \(d < 776.96\) and \(r(d) \geq 0\) for \(d \geq 776.96\). Therefore, since \(1 < \theta < 2\), when \(r(d) < 0\), the maximum risk is for \(\theta = 1\). And when \(r(d) \geq 0\) the maximum risk is occurs when \(\theta = 2\). In other words the maximum risk is \[ \max_{\theta} R(\theta, d) = \begin{cases} R(1, d) & \text{ if } d \leq 776.96 \\ R(2, d) & \text{ if } d > 776.96 \\ \end{cases} \] Next we'll minimise \(R(\theta, d)\) for a decision \(d\), for a given \(\theta\): \[ \frac{\partial}{\partial~d} R(\theta, d) = -\theta\left[\frac{0.000252}{0.0014d +0.3} -\frac{0.000392}{1.7 - 0.0014d}\right] = 0 \] So we need to solve: \[ \frac{2.52\times 10^{-4}}{1.4\times 10^{-3}d +0.3} = \frac{3.92\times 10^{-4}}{1.7 - 1.4\times 10^{-3}d} \] \[ 4.284 \times 10^{-4} - 3.528 \times 10^{-7}d = 5.488 \times 10^{-7}d + 1.176 \times 10^{-4} \] \[ 9.016 \times 10^{-7}d = 3.108 \times 10^{-4} \] \[ d = 344.72. \] Therefore \[ \underset{d}{\operatorname{argmin}} \underset{\theta}{\operatorname{max}} R(\theta, d) = \underset{d}{\operatorname{argmin}} \left\{ \begin{array}{ll} R(1, d) & \text{ if } d \leq 776.96 \\\ R(2, d) & \text{ if } d > 776.96 \end{array}\right. \] \[ =\underset{d}{\operatorname{argmin}} \left\{ \begin{array}{ll} R(1, 344.72) = -0.089 & \text{ if } d = 344.72 \\\ R(2, 776.96) = 0 & \text{ if } d = 776.96 \end{array}\right. \] so the minimax decision is \(d^{\star}_{m} = 344.72\).

Question adapted from @Berger1985.↩︎

Practical questions

Question 2

Consider a point estimation problem in which you observe \(x_1, \ldots, x_n\) as i.i.d. random variables from the Poisson distribution with density \[ f(x~\vert~ \theta) = \frac{\theta^{x}\exp\{-\theta\}}{x!}. \]

Find the optimal estimator for \(\theta\) under the loss function \[ l(\theta,d(\mathbf{x}))=(d-\theta)^4-(d-\theta)^3+(d-\theta)^2+(d-\theta) \]
Compute and graph the loss functions of \(d^{\star}_{B}(\boldsymbol{x})\) and that of the MLE \(d^{\star}_{m}(\boldsymbol{x}) = \bar{x}\). (Hint: plot the functions in terms of \((d-\theta)\) to see where their respective minima are located.)
Compare the estimates from the quartic loss function to the MLE and least-squares loss functions.

Solution:
a. The posterior distribution for \(\theta\) is proportional to \[ \pi(\theta~\vert~\boldsymbol{x}) \propto \theta^{n\bar{x}+\alpha -1}\exp\{-(n + \beta)\theta\} \] therefore the posterior has distribution \((\theta~\vert~\boldsymbol{x}) \sim \text{Gamma}(n\bar{x}+\alpha, n + \beta)\). The expected loss w.r.t. the posterior is \[ \begin{aligned} E[~L(\theta, d)~\vert~\boldsymbol{x}~] &= \int_{0}^{\infty} L(\theta,d) \pi(\theta~\vert~\boldsymbol{x}) \text{ d}\theta \\ & = \int_{0}^{\infty} (d^2 - 2d\theta + \theta^2) \pi(\theta~\vert~\boldsymbol{x}) \text{ d}\theta \\ & = d^2 \int_{0}^{\infty} \pi(\theta~\vert~\boldsymbol{x}) \text{ d}\theta - 2d \int_{0}^{\infty} \theta \pi(\theta~\vert~\boldsymbol{x}) \text{ d}\theta + \int_{0}^{\infty}\theta^2 \pi(\theta~\vert~\boldsymbol{x}) \text{ d}\theta \\ &= d^2 - 2d~E(\theta~\vert~\boldsymbol{x}) + E(\theta^2~\vert~\boldsymbol{x}) \\ &= d^2 - 2d\frac{n\bar{x}+\alpha}{\beta +n} + \frac{(n\bar{x}+\alpha)^2 + n\bar{x}+\alpha}{(\beta +n)^2} \end{aligned} \] The Bayesian decision rule is the minimum \(d\) for the above function, which occurs when \[ d^{\star}_{b} = \frac{n\bar{x}+\alpha}{\beta +n} \] or \(a = \alpha/(\beta+n)\) and \(b = n/(\beta+n)\). In other words, under quadratic loss the Bayesian decision rule is to choose the posterior mean.

b. The loss function using the Bayes decision rule is \[ \begin{aligned} L(\theta,d^{\star}_{b}) &= \left(\frac{n\bar{x}+\alpha}{\beta +n} - \theta\right)^2 = \left(\frac{Z+\alpha}{\beta +n} - \theta\right)^2 \\ &= \frac{Z^2 + 2\alpha Z + \alpha^2}{(\beta+n)^2} - 2\theta \frac{Z+\alpha}{\beta +n} + \theta^2\\ E[L(\theta,d^{\star}_{b})] &= \frac{ E[Z^2] + 2\alpha E[Z] + \alpha^2}{(\beta+n)^2} - 2\theta \frac{ E[Z]+\alpha}{\beta +n} + \theta^2, \quad E()\text{ w.r.t. to data, Z}\\ &= \frac{n\theta(1+ n\theta) + 2\alpha n\theta + \alpha^2}{(\beta+n)^2} - 2\theta \frac{ n\theta+\alpha}{\beta +n} + \theta^2 \\ &= \frac{n^{2}\theta^{2} + (2\alpha +1) n\theta + \alpha^2}{(\beta+n)^2} - 2\theta \frac{ n\theta+\alpha}{\beta +n} + \theta^2 \\ &= \frac{\theta^{2}}{(\beta+n)^2}n^2 + \left[ \frac{(2\alpha +1) \theta}{(\beta+n)^2} -\frac{2\theta^2}{\beta+n}\right]n + \left(\frac{\alpha^2}{(\beta+n)^2} - \frac{2\theta\alpha}{\beta+n} +\theta^2\right) \end{aligned} \] where \(Z\) is the sum of \(n\) Poisson random variables with mean \(\theta\). Note that \(Z \sim \text{Pois}(n\theta)\), therefore \(E(Z) = \text{Var}(Z) = n\theta\), and \(E(Z^2) = n\theta(1+ n\theta)\). The loss function using the MLE decision rule is \[ \begin{aligned} L(\theta, \bar{x}) &= \left(\bar{x} - \theta\right)^2 = \frac{Z^2}{n^2} - 2\theta\frac{Z}{n} +\theta^2 \\ E[L(\theta, \bar{x})] &= \frac{E[Z^2]}{n^2} - 2\theta \frac{E[Z]}{n} + \theta^2 \\ &= \frac{n\theta(1+ n\theta)}{n^2} - 2\theta \frac{ n\theta}{n} + \theta^2 \\ &= \frac{\theta}{n} \end{aligned} \]

c. The Bayes risk of the Bayes rule is \[ \begin{aligned} r(n)=E[~L(\theta, d^{\star}_{b})~\vert~\boldsymbol{x}~] = &-\frac{(n\bar{x}+\alpha)^2}{(\beta +n)^2} + \frac{(n\bar{x}+\alpha)^2 + n\bar{x}+\alpha}{(\beta +n)^2} \\ &= \frac{n\bar{x}+\alpha}{(\beta +n)^2} \end{aligned} \] Note: Expectation w.r.t. posterior of \(\theta\)._ therefore \(\mathcal{O}[r(n)] = 1/n\), which is decreasing and tends to zero as \(n \rightarrow \infty\)

d. The Bayes risk prior to the experiment is \(r(0) = \alpha/\beta^2\). Find \(n\) such that \[ \begin{aligned} r(0) &= 2 r(n)\\ \frac{\alpha}{\beta^2}&=2\frac{n\bar{x}+\alpha}{(\beta +n)^2} \\ \alpha(\beta^{2} + 2\beta n +n^2) &= 2 (n\bar{x}+\alpha) \beta^2 \\ \alpha\beta^{2} + 2\alpha\beta n +\alpha \beta^2n^2 &= 2\bar{x}\beta^2 n+2\alpha \beta^2 \\ 2\alpha\beta n +\alpha \beta^2n^2 &= 2\bar{x}\beta^2 n+\alpha \beta^2 \\ \alpha \beta^2n^2 + 2(\alpha\beta - \bar{x}\beta^2) n - \alpha \beta^2 &= 0 \\ \end{aligned} \] which can be solved using the quadratic equation formula.