Skip to Tutorial Content

MXB107 Blackboard Page   MXB107 Introduction to Statistical Modelling

The Analysis of Variance

When considering hypothesis testing we looked at one-sample and two-sample cases, but in many instances,especially in experimental settings (e.g. lab experiments or clinical trials) we want to compare the effects of various factors that have more than one or two possible levels. In this case we need a more generalised approach to analysing results and testing hypotheses. The statistical technique used to analyse these data and test for the statistical significance of effects is called the analysis of variance (ANOVA).

Designing Experiments

An experiment is a trial that attempts to isolate the effects of factors of interest on specific outcomes while eliminating as much as possible extraneous effects on outcomes. Experiments are typically designed to focus on a few factors and include some degree of repetition and randomisation to make statistical inferences.

An experimental unit is the object whose outcome or response is measured and is of interest in the process of the experiment. The outcome or response measured is called the dependent variable.

A factor is an independent variable controlled and varied by the experimenter. Factors are measured and varied in terms of levels or discrete states rather than as continuous or discrete numerical measures.

A treatment is a specific combination of factor levels.

A response is the variable measured by the experimenter, typically a continuous numeric response.

Analysis of Variance (ANOVA) Explained

We can explain the concept of Analysis of Variance (ANOVA) as a generalisation of a two-sample \(t\)-test. Historically, ANOVA predates the maximum likelihood principle, and the Neyman-Pearson paradigm for hypothesis testing, and can be derived separately, but it is useful to think of it in terms of hypothesis testing for the purposes of inferencs.

Paint Coverage: Deriving the Sum of Squares for ANOVA

House paint is sold in containers labelling the coverage and drying time for a single coat of paint. If manufacturers’ packaging claims that a four-litre can of their paint will cover the same area, how can we test if all three brands have the same coverage?


We can think of this experiment as an example of a simple experiment testing a single factor (brand of paint) with multiple levels (brands of paint). If we are testing a single factor with two levels (Brand A versus Brand B), measuring the response (coverage in \(m^2\)), results are analysed using a two-sample \(t\)-test. Assuming that \(n_1=n_2=n\), we could write this out as a model for our responses as \[y_{ij} = \mu_i+\epsilon_{ij}, i = A,B\mbox{ and } j = 1,\ldots,n\] where \(\mu_i\) is the mean for Brand A or Brand B, and \(j\) is sample \(j\) of the sample for Brand A or Brand B.


Now suppose that we wanted to investigate three brands of paint? (Brand A, Brand B, and Brand C). The two-sample \(t\)-test is no longer a valid approach to analysing the data. Instead, we can extend this statistical model for the experiment to analyse the data (assuming that each sample is of equal size, i.e. \(n_1+n_2+n_3=n\) \[y_{ij} = \mu_i+\epsilon_{ij}, i = A, B, C \mbox{ and } j = 1,\ldots,n\]

Let’s assume that the outcome of the experiment for replication \(j\) of treatment \(i\) that is the result for is \(y_{ij}\), then the total outcome of the experiment is \[\mathbf{y} = (y_{11},y_{12},\ldots,y_{IJ})\] where there are \(I\) different treatments each replicated \(J\) times, and \(n = IJ\). We can describe the total variation in experimental outcomes as the total sum of squares (SST) is defined as \[SST = \sum_{i=1}^I\sum_{j=1}^J(y_{ij}-\bar{y}_{..})^2.\] where \[\bar{y}_{..}=\frac{1}{IJ}\sum_{i=1}^I\sum_{j=1}^Jy_{ij}\] is the grand mean, the overall average response over all treatments and repetitions. The total sum of squares can be decomposed into two parts. First the sum of the squares of the error SSE is the pooled variation within treatment group \(i\) \[\begin{aligned} SSE &= (n_1-1)s_1^2+\cdots+(n_I-1)s_I^2\\ &=\sum_{i=1}^I \sum_{j=1}^J (y_{ij}-\bar{y}_i)^2\end{aligned}\] where \[\bar{y}_{i}=\frac{1}{J}\sum_{j=1}^Jy_{ij}\] or the mean for treatment \(i\). Second, the sum of the squares of the treatments SSTR is \[SSTR = \sum_{i=1}^IJ(\bar{y}_{i}-\bar{y}_{..})^2\] resulting in \[SST = SSTR + SSE\]

We can show that we can decompose the total sum of squares into the sum of the squares of the treatments and the sum of the squares of the error.

Each source of variation is computed as a sum of squares, and can be divided by their degrees of freedom, as an estimate of their contribution to the total variance. These results are presented in an ANOVA table.

Source degrees of freedom Sum of Squares Mean Squares \(F\)
Treatment I-1 SSTR MSTR = SSTR/(I-1) MSTR/MSE
Error n-I SSE MSE = SSE/(n-I)
Total n-1 SST

Note that \(n = IJ\) and the mean squares are just the sum of squares divided by their degrees of freedom. The statistic \(F\) is the ratio of the MSTR and the MSE; or the ratio of the between treatment variation and the within treatment variation.

ANOVA Inference

We assume that the observations \(y_{ij}\) are distributed following a Gaussian distribution with a common variance \(\sigma^2\), but not necessarily the same mean. Under these assumptions, the statistic \(F\), the ratio of the mean squares of the treatment and the mean square of the error will follow an \(F\)-distribution.

The \(F\) distribution is a continuous probability distribution that describes the sampling distribution of the ratio of two sample variances. The \(F\) distribution has two parameters, which are the degrees of freedom of the numerator and the degrees of freedom of the denominator. \[ \frac{s_1^2}{s^2_2}\sim F_{\nu_1,\nu_2} \] where if \[ s_1^2=\frac{1}{n_1-1}\sum_{i=1}^{n_1}(x_i-\bar{x})^2\quad\mbox{and} \quad s^2_2=\frac{1}{n_2-1}\sum_{j=1}^{n_2}(y_j-\bar{y})^2 \] then \[ \nu_1 = n_1-1\quad \mbox{and}\quad \nu_2=n_2-1. \]

We can think about this in terms of a hypothesis test; if you wanted to test whether or not one of your treatments produced a different outcome than the others, then you would use the hypotheses \[ H_0: \mu_1=\mu_2=\cdots=\mu_I\qquad\mbox{and}\qquad H_A: \mbox{ at least one treatment mean is different}. \] Under these hypotheses, the null assumption means that \[ SSTR\approx SSE \] i.e. that the variance between treatments was approximately equal to the variance within treatments, or that all treatment means are the same. So, in this case, you would reject the null hypothesis if MSTR was larger than MSE, or that basically, the treatment accounted for more of the total variance than the error. This decision translates to rejecting for large values of \(F\), so we can state that our rejection rule would be: \[F>F_{\mbox{critical}}\] and just as in previous cases we can choose our critical value based on a Type I Error rate so that \[ F_{\mbox{critical}} = F_{\nu_1,\nu_2,\alpha} \] where \[ Pr(F<F_{\mbox{critical}})= 1-\alpha. \] Finding the critical value for \(F\) can be easily done using R or other statistical software. In practice, most statistical software will produce an ANOVA table with \(F\) statistics computed and some indication of the results of testing.

Example:

A study in Kirchhoefer (1979) reviewed the results of an experiment evaluating a semi-automated method for measuring chlorpheniramine maleate in tablets. The experimenter allocated ten samples to each of seven labs, and the results were analysed to determine if there were differences between the labs. The results in tabular form are:

Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
4.13 3.86 4 3.88 4.02 4.02 4
4.07 3.85 4.02 3.88 3.95 3.86 4.02
4.04 4.08 4.01 3.91 4.02 3.96 4.03
4.07 4.11 4.01 3.95 3.89 3.97 4.04
4.05 4.08 4.04 3.92 3.91 4 4.1
4.04 4.01 3.99 3.97 4.01 3.82 3.81
4.02 4.02 4.03 3.92 3.89 3.98 3.91
4.06 4.04 3.97 3.9 3.89 3.99 3.96
4.1 3.97 3.98 3.97 3.99 4.02 4.05
4.04 3.95 3.98 3.9 4 3.93 4.06


If we want to test the hypotheses \[ H_0: \mbox{There is no difference between labs}\quad\mbox{and}\quad H_A: \mbox{At least one lab is different} \] Then we can use the \(F\) statistic and perform a \(F\)-test. The degrees of freedom for the numerator and denominator can be taken from the table as 6 and 63 respectively use R to determine that \[ F_{\mbox{critical}}=2.25 \] for \(\alpha = 0.05\).

Unfortunately, further analysis requires using R rather than the tedious hand calculations involved.

y <- c(4.13,4.07,4.04,4.07,4.05,4.04,4.02,4.06,4.10,4.04,
       3.86,3.85,4.08,4.11,4.08,4.01,4.02,4.04,3.97,3.95,
       4.00,4.02,4.01,4.01,4.04,3.99,4.03,3.97,3.98,3.98,
       3.88,3.88,3.91,3.95,3.92,3.97,3.92,3.90,3.97,3.90,
       4.02,3.95,4.02,3.89,3.91,4.01,3.89,3.89,3.99,4.00,
       4.02,3.86,3.96,3.97,4.00,3.82,3.98,3.99,4.02,3.93,
       4.00,4.02,4.03,4.04,4.10,3.81,3.91,3.96,4.05,4.06)
labs <- c(rep("Lab 1", 10),rep("Lab 2", 10), rep("Lab 3",10), rep("Lab 4",10),
          rep("Lab 5",10), rep("Lab 6",10),rep("Lab 7",10))

df <- tibble(y = y, lab = labs)

model <- lm(y~lab, data = df)

panderOptions('missing',"")
model%>%anova()%>%pander()
Analysis of Variance Table
  Df Sum Sq Mean Sq F value Pr(>F)
lab 6 0.1247 0.02079 5.66 9.453e-05
Residuals 63 0.2314 0.003673



We can see from the results that the MSE=0.003673 and MSTR=0.02079 resulting in F=5.66, since the critical value of F is 2.25, we can reject the null hypothesis and assume that the measuring process in at least one lab is different than the others.

This example describes a single factor experiment where treatments are assigned to individuals completely at random or a fully randomised trial. The results from rejecting the null hypothesis tell us that at least one of the treatments results in different outcomes compared to the rest, but beyond that, it doesn’t tell us much, in fact, we need some more sophisticated tools.

The Problem of Multiple Comparisons

The \(F\)-test for ANOVA is robust, and relatively straightforward; it is based on comparing the variance within groups (or treatments) and the variance between groups (or treatments). If we see that the variance between groups is much greater than the variance within groups we reject \(H_0\) the null hypothesis that all groups or treatments have the same effect. Rejecting the null hypothesis however simply tells us that at least one of the treatments is different; not which one.

Instinctively we might suggest that a way of testing to see which treatment is different would be to perform multiple two-sample \(t\)-test like we learned previously.

In the example from Kirchhoefer (1979) there are seven labs, and to test all the possible pairwise comparisons we would have to perform \[{7\choose 2} = 21\] two-sample \(t\)-tests. Each of these tests has a Type I Error Rate of \(\alpha\), thus in performing 21 tests the actual Type I Error Rate is the probability that you make at least one Type I error or \[1-Pr(\mbox{you make no Type I errors}=1-(1-\alpha)^{21}.\] We can calculate this for \(\alpha = 0.05\) as \[1-(1-0.05)^{21}=0.659\] This result means that our testing procedure is far less reliable than we thought, and our inferences could be very wrong, meaning that we have a very high probability of making a mistake in deciding that two treatment means differ.

Testing the Equality of the Treatment Means

There are a number of different approaches to addressing this problem of an inflated Type I error rate when testing the difference between multiple treatment means; one is Tukey’s Honest Significant Difference (HSD) based on the sampling distribution of \[ q_{I,I(J-1)}=\max_{i_1,i_2}\frac{|(\bar{y}_{i_1}-\mu_{i_1})-(\bar{y}_{i_2}-\mu_{i_2})|}{s_p/\sqrt{J}} \] or Tukey’s Studentised Range, where \(i_1\) and \(i_2\) refer to any arbitrary pairwise comparison of treatments. Using this sampling distribution, we can construct a confidence interval to perform the pairwise test of multiple treatment means \(\mu_{ij}\) \[ H_0: \mu_{i_1}=\mu_{i_2}\qquad\mbox{and}\qquad H_A: \mu_{i_1}\neq\mu_{i_2}. \] For a Type I Error Rate of \(\alpha\) we can reject \(H_0\) if \[ |\bar{y}_{i_1}-\bar{y}_{i_2}|>q_{I,I(J-1),\alpha}\frac{s_p}{\sqrt{J}} \] Finding values of \(q_{I,I(I-J),\alpha}\) requires either special software (e.g. R) or a set of tables.

Example:

In the example from Kirchhoefer (1979) the results of the hypothesis test indicates that at least one of labs produces different measurements than the rest. Use Tukey’s Honest Significant Difference to identify which labs differ from the others.


Continuing with the example of lab measurements, we can list the labs in decreasing order of their mean measurements

Lab Mean
1 4.062
3 4.003
7 3.998
2 3.997
5 3.957
6 3.995
4 3.920

For a Type I Error Rate of \(\alpha = 0.05\) we can find \(q_{7,63,\alpha} = 4.31\) in R qtukey(0.95,7,63) \(s_p\) is the square root of the mean squared error (MSE) resulting in \(s_p = 0.06\). According to Tukey’s method we would reject the hypothesis that two means were the same if \[|\bar{y}_{i_1}-\bar{y}_{i_2}|>0.082\] based on this, we conclude that mean of the measurements produced by Lab 1 differ from those of Lab 4, 5, and 6.

Note that if we had performed a regular \(t\) test at the same Type I Error Rate, the rejection rule would have been \[|\bar{y}_{i_1}-\bar{y}_{i_2}|>0.053\] and we would have concluded that Lab 1 produced measurements that differed from every other lab.

y <- c(4.13,4.07,4.04,4.07,4.05,4.04,4.02,4.06,4.10,4.04,
       3.86,3.85,4.08,4.11,4.08,4.01,4.02,4.04,3.97,3.95,
       4.00,4.02,4.01,4.01,4.04,3.99,4.03,3.97,3.98,3.98,
       3.88,3.88,3.91,3.95,3.92,3.97,3.92,3.90,3.97,3.90,
       4.02,3.95,4.02,3.89,3.91,4.01,3.89,3.89,3.99,4.00,
       4.02,3.86,3.96,3.97,4.00,3.82,3.98,3.99,4.02,3.93,
       4.00,4.02,4.03,4.04,4.10,3.81,3.91,3.96,4.05,4.06)
labs <- c(rep("Lab 1", 10),rep("Lab 2", 10), rep("Lab 3",10), rep("Lab 4",10),
          rep("Lab 5",10), rep("Lab 6",10),rep("Lab 7",10))

df <- tibble(y = y, lab = labs)

model <- lm(y~lab, data = df)

panderOptions('missing',"")
model%>%anova()%>%pander()
Analysis of Variance Table
  Df Sum Sq Mean Sq F value Pr(>F)
lab 6 0.1247 0.02079 5.66 9.453e-05
Residuals 63 0.2314 0.003673

model_res <- model%>%anova()
panderOptions('digits', 3)
(model%>%aov()%>%TukeyHSD())$lab%>%round(3)%>%pander()
  diff lwr upr p adj
Lab 2-Lab 1 -0.065 -0.148 0.018 0.217
Lab 3-Lab 1 -0.059 -0.142 0.024 0.323
Lab 4-Lab 1 -0.142 -0.225 -0.059 0
Lab 5-Lab 1 -0.105 -0.188 -0.022 0.005
Lab 6-Lab 1 -0.107 -0.19 -0.024 0.004
Lab 7-Lab 1 -0.064 -0.147 0.019 0.232
Lab 3-Lab 2 0.006 -0.077 0.089 1
Lab 4-Lab 2 -0.077 -0.16 0.006 0.083
Lab 5-Lab 2 -0.04 -0.123 0.043 0.758
Lab 6-Lab 2 -0.042 -0.125 0.041 0.714
Lab 7-Lab 2 0.001 -0.082 0.084 1
Lab 4-Lab 3 -0.083 -0.166 0 0.048
Lab 5-Lab 3 -0.046 -0.129 0.037 0.62
Lab 6-Lab 3 -0.048 -0.131 0.035 0.572
Lab 7-Lab 3 -0.005 -0.088 0.078 1
Lab 5-Lab 4 0.037 -0.046 0.12 0.818
Lab 6-Lab 4 0.035 -0.048 0.118 0.853
Lab 7-Lab 4 0.078 -0.005 0.161 0.076
Lab 6-Lab 5 -0.002 -0.085 0.081 1
Lab 7-Lab 5 0.041 -0.042 0.124 0.736
Lab 7-Lab 6 0.043 -0.04 0.126 0.691

Tukey’s method can be used to construct confidence intervals for the difference between two treatment means (as shown). Statistical software (including R) produce special tables of pairwise comparisons for treatment means that illustrate both the ranking of the treatment means as well as the results of hypothesis testing to determine if differences are significant.

Note that looking at the table and reading the \(p\)-values we can see that at the Type I error rate of \(\alpha=0.05\) labs 4,5, and 6 are significantly different than lab 1 and that labs 4 and 3 are significantly different.

Randomised Block Designs: Two-Way Classification

The single factor completely randomised designs previously described are extensions of two-sample \(t\)-tests using \(F\)-tests for inference and adjusting the Type I Error Rates to account for multiple comparisons. This model assumes that the only sources for observed variation in responses are either the treatment effects or random effects or experimental errors. Sometimes however it is obvious to the designer of an experiment that not all subjects are homogeneous, so the idea of allocating treatments completely at random might induce bias or otherwise invalidate our results. Often this heterogeneity is not able to be controlled by the experimenter, and is not of primary interest, but must be controlled. Controlling for this source of variation is done by using a randomised block design, which is a direct extension of the matched pairs or paired difference designs shown in Week 9.

Blocking

In a randomised experimental design, if there are \(I\) treatments of interest, the experimenter chooses \(J\) blocks (\(J>I\)) each with \(I\) subjects, one for each treatment. This scheme is to isolate the block to block variability that might obscure the treatment effects.

A researcher wants to know the average yield for three different species of fruit trees, but there is variation in yield depending on the field that the trees are planted in. The researcher is not interested in the variation due to the fields, so the researcher picks five different fields and divides each into three plots, and plants each of the three species in one of the plots.

In this example, two factors account for variation in the responses: the treatments and the blocks. Thus our model is now: \[y_{ij} = \alpha_i+\beta_j+\epsilon_{ij}\] where \(\alpha_i\) is the mean effect for treatment \(i\) and \(\beta_j\) is the mean effect for block \(j\). We use ANOVA to explore the variation of both of these factors and random experimental error.

The ANOVA Table for Randomised Block Designs

To partition the variance let \(y_{ij}\) be the response for treatment \(i\) in block \(j\). Then the total variation for \(n=IJ\) responses can be partitioned into three parts: \[SST = SSB+SSTR+SSE\] where SSB is the sum of the squares of the blocks, which measures the variability between blocks. \[SSB = I\sum_{j=1}^J(\bar{y}_{j}-\bar{y}_{..})^2\] where \[\bar{y}_j = \frac{1}{I}\sum_{i=1}^Iy_{ij}.\] and \[SSE = \sum_{i=1}^I\sum_{j = 1}^J(y_{ij}-\bar{y}_j-\bar{y}_I-\bar{y}_{..})^2.\] In this case the general form of the ANOVA table is

Source Sum of Squares degrees of freedom MS F
Block SSB J-1 SSB/(J-1) MSB/MSE
Treatment SSTR (I-1) SSTR/(I-1) MSTR/MSE
Error SSE (I-1)(J-1) SSE/((I-1)(J-1))
Total SST n-1

Note that the \(F\) statistics can be used as in the single-factor design to the test the hypotheses: \[ H_0:\mbox{All treatment means are the same.}\: \mbox{ vs }\: H_A:\mbox{At least one treatment mean is different.} \] or \[ H_0:\mbox{All block means are the same}\:\mbox{ vs }\: H_A:\mbox{At least one block mean is different}. \] Also, we can use Tukey’s Honest Significant Differences (HSD) to rank and evaluate the differences between means for both treatments and blocks.

Example:

A researcher is interested in the differences in price for different mobile phone providers. They want to compare prices from 4 different companies (A,B,C,D), but to account for differences between plans they create blocks based on different plan types: low, medium, and high usage. The resulting data are:

Usage Level A B C D
Low 27 24 31 23
Medium 68 76 65 67
High 308 326 312 300


The results for the ANOVA without blocking on the usage level are:

Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Company 3 222 74.1 0.003 1
Residuals 8 189577 23697


The results for the ANOVA blocking by usage label are:

Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Usage Level 2 189335 94668 2352 0
Company 3 222 74.1 1.84 0.24
Residuals 6 242 40.2

Notice that when we block on the usage level, the \(p\)-value for Company decreases, though it is still not statistically significant at the Type I error rate of \(\alpha = 0.05\), it is a substantial change.

Without Blocking

df <- tibble(`Usage_Level` = c("Low", "Medium","High"),A = c(27,68,308), B = c(24,76,326), C = c(31,65,312), D = c(23,67,300))

df_l <- pivot_longer(df,cols = c("A","B","C","D"),names_to = "Company")
model <- lm(value~.,data = df_l)
model_nb <- lm(value~Company, df_l)

panderOptions('missing', "")
panderOptions('round', 3)
panderOptions('plain.ascii', 'TRUE')
model_nb%>%aov()%>%pander()
Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Company 3 222 74.1 0.003 1
Residuals 8 189577 23697


And the results for Tukey’s HSD for Company without blocking by usage level are:

model_nb%>%aov()%>%TukeyHSD(which = "Company")%>%pander()
  • Company:

    diff lwr upr p adj
    B-A 7.67 -395 410 1
    C-A 1.67 -401 404 1
    D-A -4.33 -407 398 1
    C-B -6 -409 397 1
    D-B -12 -415 391 1
    D-C -6 -409 397 1


With Blocking

Now we add blocking on Usage_Level

panderOptions('missing', "")
panderOptions('round', 3)
panderOptions('plain.ascii', 'TRUE')
model%>%aov()%>%pander()
Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Usage_Level 2 189335 94668 2352 0
Company 3 222 74.1 1.84 0.24
Residuals 6 242 40.2


And the results for Tukey’s HSD for Company with blocking by usage level are:

model%>%aov()%>%TukeyHSD(which = "Company")%>%pander()
  • Company:

    diff lwr upr p adj
    B-A 7.67 -10.3 25.6 0.502
    C-A 1.67 -16.3 19.6 0.987
    D-A -4.33 -22.3 13.6 0.836
    C-B -6 -23.9 11.9 0.671
    D-B -12 -29.9 5.93 0.196
    D-C -6 -23.9 11.9 0.671


And the results for Tukey’s HSD for usage level with blocking by usage level are:

model%>%aov()%>%TukeyHSD(which = "Usage_Level")%>%pander()
  • Usage_Level:

    diff lwr upr p adj
    Low-High -285 -299 -271 0
    Medium-High -242 -256 -229 0
    Medium-Low 42.8 29 56.5 0

Note that there appear to be significant differences between costs for various usage levels, but not between companies. So while it is evident that there is a difference between the costs for usage levels (and this is not of interest), it appears that even when accounting for usage level, there are still no significant differences in costs between service providers.

Two-Factor Experiments

A randomised block design provides some structure to control for extraneous sources of variation that are not of interests. But in many cases a researcher may want to explore the effects of two different factors (both of interest) as well as their interaction (definitely of interest). Experiments of this kind are called Two-Factorial Experiments.

Medical researchers often want to test the effects of various drugs, but in many cases, patients may be taking more than one medication. In this case, researchers need to test the effects of multiple medications and their interactions on patient outcomes using the model \[ y_{ijk} = \alpha_i+\beta_j+(\alpha\beta)_{ij}+\epsilon_{ijk} \] where \(\alpha_i\) is the mean effect for drug 1, \(\beta_j\) is the mean effect of drug 2, and \((\alpha\beta)_{ij}\) is the mean effect of the interaction between drug 1 and drug 2.

ANOVA for Two-Factor Experiments

In a two factor experiment, factor A may have \(I\) levels, factor B may have \(J\) levels and each of these replicated \(K\) times, making \(y_{ijk}\) observation \(k\) of the combination of treatments \(i\) and \(j\). We then can partition the total sum of squares into four parts \[ SST = SSA + SSB + SSAB + SSE \] where:

  • SSA measures the variation among the means of factor A.

  • SSB measures the variation among the means of factor B.

  • SSAB measures the variation among the different combinations of levels for A and B.

  • SSE measures the variation among the observations within each combination of levels for A and B.

Given that there are \(I\) levels for factor A, \(J\) levels for factor B, and \(K\) replicates of each combination, the total number of observations is \(n = IJK\).

Setting aside the exact formulae for computing the sums of squares, the resulting ANOVA table for analysis is

Source Degrees of Freedom Sum of Squares Mean Squares F
A I-1 SSA SSA/(I-1) MSA/MSE
B J-1 SSB SSB/(J-1) MSB/MSE
AB (I-1)(J-1) SSAB SSAB/((I-1)(J-1)) MSAB/MSE
Error IJ(K-1) SSE SSE/IJ(K-1)
Total IJK-1 SST

Much of the analyses and inference carries over from the previous examples but extended to the two-factor case. The hypotheses of interest then become. \[ H_0: \mbox{Factor A treatment means are equal}\:\mbox{ vs }\: H_A: \mbox{At least one treatment mean is different} \] \[ H_0: \mbox{Factor B treatment means are equal}\:\mbox{ vs }\: H_A: \mbox{At least one treatment mean is different} \] \[ H_0: \mbox{There is no interaction between A and B}\quad\mbox{vs}\quad H_A: \mbox{Factors A and B interact} \]

We can test these hypotheses using a specified Type I Error Rate (typically \(\alpha = 0.05\)) and identifying the appropriate rejection region for their \(F\) statistics. Also, Tukey’s HSD can be used to assess the individual treatment differences.

Example:

The manager of a manufacturing plant knows that the output of a production line depends on two factors:

  • Which of two supervisors is in charge of the line

  • Which of three shifts (day, night, or swing) is working

The following data are collected echo
Supervisor Day Swing Night
1 571 480 470
1 610 474 430
1 625 540 450
2 480 625 630
2 516 600 680
2 465 581 661


The results for the ANOVA analysis in tabular form are:

Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Supervisor 1 19208 19208 26.7 2e-04
Shift 2 247 124 0.172 0.844
Supervisor:Shift 2 81127 40564 56.3 0
Residuals 12 8640 720

These results indicate that there doesn’t appear to be a significant difference between shifts, but that there is a significant difference between supervisors. This can be explored by looking at the differences in the mean outcome for each supervisors for each shift using Tukey’s HSD:

  diff lwr upr p adj
2:Day-1:Night 37 -36.6 111 0.562
1:Swing-1:Night 48 -25.6 122 0.309
2:Swing-1:Night 152 78.4 226 2e-04
1:Day-1:Night 152 78.4 226 2e-04
2:Night-1:Night 207 133 281 0
1:Swing-2:Day 11 -62.6 84.6 0.995
2:Swing-2:Day 115 41.4 189 0.0022
1:Day-2:Day 115 41.4 189 0.0022
2:Night-2:Day 170 96.4 244 1e-04
2:Swing-1:Swing 104 30.4 178 0.0049
1:Day-1:Swing 104 30.4 178 0.0049
2:Night-1:Swing 159 85.4 233 1e-04
1:Day-2:Swing 0 -73.6 73.6 1
2:Night-2:Swing 55 -18.6 129 0.195
2:Night-1:Day 55 -18.6 129 0.195

These results indicate that there is a significant difference between supervisors for each shift, but that while supervisor 2 appears to perform better over all (and on the swing and night shifts), supervisor 1 performs better on the day shift.

df <- tibble(Supervisor = c(1,1,1,2,2,2)%>%as.factor(), 
             Day = c(571,610,625,480,516,465), 
             Swing = c(480,474,540,625,600,581),
             Night = c(470,430,450,630,680,661))
df_l <- pivot_longer(df,cols = c(Day,Swing,Night),names_to = "Shift")
model <- lm(value~Supervisor+Shift+Supervisor:Shift,df_l)

pander(df)
Supervisor Day Swing Night
1 571 480 470
1 610 474 430
1 625 540 450
2 480 625 630
2 516 600 680
2 465 581 661
panderOptions('missing', "")
panderOptions('round', 4)
panderOptions('plain.ascii', 'TRUE')
model%>%aov()%>%pander()
Analysis of Variance Model
Df Sum Sq Mean Sq F value Pr(>F)
Supervisor 1 19208 19208 26.7 2e-04
Shift 2 247 124 0.172 0.844
Supervisor:Shift 2 81127 40564 56.3 0
Residuals 12 8640 720
tukey_model <- (model%>%aov()%>%TukeyHSD(which = 3,ordered = TRUE))$'Supervisor:Shift'
emphasize.strong.rows(c(5,8,10))
pander(tukey_model,plain.ascii = FALSE)
  diff lwr upr p adj
2:Day-1:Night 37 -36.6 111 0.562
1:Swing-1:Night 48 -25.6 122 0.309
2:Swing-1:Night 152 78.4 226 2e-04
1:Day-1:Night 152 78.4 226 2e-04
2:Night-1:Night 207 133 281 0
1:Swing-2:Day 11 -62.6 84.6 0.995
2:Swing-2:Day 115 41.4 189 0.0022
1:Day-2:Day 115 41.4 189 0.0022
2:Night-2:Day 170 96.4 244 1e-04
2:Swing-1:Swing 104 30.4 178 0.0049
1:Day-1:Swing 104 30.4 178 0.0049
2:Night-1:Swing 159 85.4 233 1e-04
1:Day-2:Swing 0 -73.6 73.6 1
2:Night-2:Swing 55 -18.6 129 0.195
2:Night-1:Day 55 -18.6 129 0.195

Worksheet Practical Question 1

The variation of the strength of (coloring power) of a dyestuff from one manufacturing batch to another was studied (Davies, 1960). Strength was measured by dyeing a square of cloth with a standard concentration of dyestuff and visually comparing the result with a standard.


The result was numerically scored as the percentage strength of the dyestuff. Large samples were taken from six batches and from each batch six sub-samples were taken. The 36 subsamples were submitted to the laboratory in a random order for testing as described above. Analyse the data using a one-way ANOVA


dye <- pivot_longer(dye,cols = names(dye))
names(dye) <- c("Batch","Strength")
model<-lm(Strength~Batch, data = dye)

  anova(model)%>%
  kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")
model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(1)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

Worksheet Practical Question 2

The concentration (in nanograms per millilitre) of plasma epinephrinewere measured for ten dogs under isofluorane, halothane, cyclopropane anesthesia (Perry, et. al. 1974).

Analyse these data using an one-factor ANOVA on the anesthesia


dog <- dog%>%
        pivot_longer(cols = names(dog[-1]))
names(dog) <- c("anesthesia", "dog", "epinephrine")
model<-lm(epinephrine~anesthesia, data = dog)

  anova(model)%>%
  kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")
Now, perform the same ANOVA, but block on the variable “dog”
model<-lm(epinephrine~anesthesia+dog, data = dog)

  anova(model)%>%
  kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")
model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(1)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

Comments? Thought?

Worksheet Practical Question 3

The following data describe the outcome of an experiment where the survival time of animals in hours is measured for animals receiving one of three poisons and one of four treatments (Box and Cox 1964).

Analyse these data using a one-factor ANOVA for Poison, a two-factor ANOVA for Poison and Treatment (without interaction) and a two-factor ANOVA with interaction between Treatment and Poison

head(poison)
model <- lm(Survtime ~ Poison, poison)

anova(model)%>%
    kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
    kable_styling(position = "center")
model <- lm(Survtime ~ Poison + Treatment, poison)

anova(model)%>%
    kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
    kable_styling(position = "center")
model <- lm(Survtime ~ Poison + Treatment + Poison:Treatment , poison)

anova(model)%>%
    kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
    kable_styling(position = "center")
model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(1)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(2)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Poison Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

Additional Questions for Practice

  1. Experimenters infect a group of rats with worms. Eight days later, the experimenters select five rats at random and treat them with carbon tetrachloride. Two days later researchers count the number of worms in each of the rats. This experiment is repeated four times.


Is there reason to suspect that there is a significant difference between replications?

‘GroupI’ ‘GroupII’ ‘GroupIII’ ‘GroupIV’
279 378 172 381
338 275 335 346
334 412 335 340
198 265 282 471
303 286 250 318
rats <- rats%>%pivot_longer(cols = names(rats))
rats <- rats%>%pivot_longer(cols = names(rats))

model <- lm(value~name, rats)

model%>%
  anova()%>%
    kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
    kable_styling(position = "center")

model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(1)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")
  1. Researchers studying the effects of lutenizing hormone (LH) conducted an experiment where male rats were either kept under constant light, or were exposed to 14 hours of light followed by ten hours of darkness and given varying doses of lutenizing release hormone (LRH). The researchers recorded the rats’ LH levels.


Analyse the data to determine the effect of the light regime and LRH dosage on the production of LH in male rats.

head(rats_lh)
colnames(rats_lh) <- c("dose","normal","constant")
rats_lh$dose<-as.factor(rats_lh$dose)

rats_lh<-pivot_longer(rats_lh,cols = c(normal,constant))
colnames(rats_lh) <- c("dose","light","response")

model <- lm(response~dose+light+dose:light, rats_lh)

model%>%
  anova()%>%
    kable(digits = 3, "html",table.attr = "style='width:80%;'",align = "c")%>%
    kable_styling(position = "center")

model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(1)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

model%>%
  aov()%>%
  TukeyHSD()%>%
  magrittr::extract2(2)%>%
  kable(digits =  3,caption="Tukey Confidence Intervals for Difference in Treatment Means",booktabs = TRUE,col.names = c("Difference of Means","LCL","UCL",expression(p-value)),"html",table.attr = "style='width:80%;'",align = "c")%>%
  kable_styling(position = "center")

MXB107 Worksheet 10 - Analysis of Variance