Tutorial 6 - Linear Regression 2

Introduction

Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.

NOTE: If following along on your own machine in RStudio you will need to make sure the following packages are loaded (and, if not already, installed).

library(tidyverse)
library(gapminder)
library(broom)

In this tutorial, we will discuss diagnostic plots for linear regression. Before starting this tutorial we must re-fit our example linear models from the previous tutorial. Recall that there were two: a simple linear regression lifeExp ~ log(gdpPercap), and a polynomial model fitting the gdpPercap as a quadratic effect using variables from the gapminder dataset.

# the simple linear model 
linear_model <- lm(lifeExp ~ log(gdpPercap), data = gapminder)

# the polynomial model 
polynomial_model <- lm(
  lifeExp ~ poly(log(gdpPercap), 2, raw = T), 
  # raw = T is needed so R doesn't standardise the x variable
  data = gapminder
)

Extracting Residuals for Linear Models

In this tutorial, we will discuss diagnostic plots for linear regression. However, prior to doing that, we need to extract out the diagnostic information from the linear model. This is done using a ggplot2 function called fortify().

The `fortify()` function

This data frame contains the predicted values (.fitted) as well as the residuals (.resid) and standardised residuals (.stdresid). There are other bits of information there, such as the leverage (.hat) of each point, but these will not be needed at the level of these tutorials.

fort_df <- fortify(linear_model)

You can see an excerpt of the fortified data below.

head(fort_df)

Model Diagnostics

Regression diagnostics allow you to check whether your fitted model satisfies the assumptions of linear regression.

Residuals vs Fitted

The residuals vs fitted plot is a diagnostic that can be used to assist checking two assumptions:

The zero mean assumption: the mean of the residuals must be zero at all points along this graph.
The homogeneity of variances (homoscedasticity) assumption: the variance (spread) of the residuals must be approximately equal across the whole plot.

The residuals vs fitted plot can be produced using the fortified data frame from the previous section.

ggplot(data = fort_df, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) + # this makes it easier to check the zero mean assumption
  labs(x = "Fitted values", y = "Residuals") +
  theme_bw()

Note that both the x and y axis in this plot have the same units as the y variable in your linear regression model. However, we do not usually put the units on the axis labels of this graph.

The Quantile-Quantile (QQ) Plot

The normal quantile-quantile plot, also called a qq-plot, allows us to check whether the residuals from a linear regression model are normally distributed.

ggplot(data = fort_df, aes(sample = .stdresid)) +
  stat_qq() + # use stat_qq to plot the qq-plot
  geom_abline(slope = 1, intercept = 0) + # draws the one-to-one comparison line 
  theme_bw()

The residuals are considered normally distributed if the points on this plot fall along the straight line with an intercept of 0 and a slope of 1.

It is good practice to update the axis labels and make them more information, like so.

ggplot(data = fort_df, aes(sample = .stdresid)) +
  stat_qq() + # use stat_qq to plot the qq-plot
  geom_abline(slope = 1, intercept = 0) + # draws the one-to-one comparison line 
  labs(x = "q values from standard normal",
       y = "q values from standardised residuals") +
  theme_bw()

Assessing Goodness-of-Fit

Often when fitting linear models, we do not just try one and leave it at that. There is usually some comparison between many nested models involved.

Comparing Nested Models with an \(F\)-Test: the `anova()` function

The anova() function can be used to perform an \(F\)-test with the null hypothesis that some linear model explains the same variability (for the population, not the data) as another linear model that is nested inside it (i.e., some parameters are set to zero).

Below, we compare the model: \[{\sf mpg} = \beta_0 + \beta_1 \cdot {\sf wt} + \beta_2\cdot{\sf drat} + \beta_3\cdot {\sf wt}\cdot {\sf drat} \] against the model \[{\sf mpg} = \beta_0 + \beta_1 \cdot {\sf wt}.\] using variables from the mtcars dataset.

model1 <- lm(data = mtcars, mpg ~ wt * drat)
model2 <- lm(data = mtcars, mpg ~ wt)
anova(model2, model1)

It is important to note that the two models must be nested for the interpretation of the \(p\)-value to match the description of the alternative hypothesis above.

An advantage of using an \(F\)-test is that it can compare models with a large difference in the number of terms. This can help avoid errors due to multiple comparisons: even if the null hypothesis is true, 5% of the time one will reject it at the \(\alpha=0.05\) level (so one wants to avoid doing too many tests if possible, recall the jellybean example cartoon.

Combining plots into one figure: The `patchwork` package

The patchwork package is a very neat package that makes it extremely simple to put multiple plots made with ggplot() into the same figure with the layout you desire. Using it is quite easy: save each plot you create as an object (below it is done with three plots, called p1, p2, and p3), and then, once the patchwork package is loaded, you can display the plots together in the desired format by mentioning the plots using + and / syntax.

For example, p1+p2+p3 would then display all three plots next to each other, whereas p1/(p2+p3) displays the first plot on the first row, and then the other two on the second row.

The code below fits a linear model using the iris data, and visualises the points and the fit line, as well as two regression diagonostics, all within one figure.

# plot the scatterplot of the data 
p1 <- iris %>%    # here we use the pipe operator to push the iris data into the next bit of code
  ggplot(aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(aes(color = Species), alpha = 0.7) +
  geom_smooth(method = "lm") +
  theme_bw()

# fit the model and extract the residuals 
model <- lm(data = iris, Petal.Length ~ Petal.Width) 
model.fort <- model %>% 
  fortify()

# create the resid v fitted plot 
p2 <- ggplot(data =  model.fort, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) + # this makes it easier to check the zero mean assumption
  labs(x = "Fitted values", y = "Residuals") +
  theme_bw()

# Plot the normal distribution
p3 <- ggplot(
    data = model.fort, 
    aes(sample = .stdresid)
  ) + 
  stat_qq() + 
  geom_abline() +    # geom_abline actually defaults to int = 0 and slope = 1
  labs(x = "q values from std. normal",
       y = "q values from std. resid.") +
  theme_bw()

# stitch it all together 
library(patchwork)
p1 / (p2 + p3)