Tutorial 6 - Linear Regression 2
Introduction
Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.
NOTE: If following along on your own machine in RStudio you will need to make sure the following packages are loaded (and, if not already, installed).
library(tidyverse)
library(gapminder)
library(broom)
In this tutorial, we will discuss diagnostic plots for linear
regression. Before starting this tutorial we must re-fit our example
linear models from the previous tutorial. Recall that there were two: a
simple linear regression lifeExp ~ log(gdpPercap), and a
polynomial model fitting the gdpPercap as a quadratic
effect using variables from the gapminder dataset.
# the simple linear model
linear_model <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
# the polynomial model
polynomial_model <- lm(
lifeExp ~ poly(log(gdpPercap), 2, raw = T),
# raw = T is needed so R doesn't standardise the x variable
data = gapminder
)
Extracting Residuals for Linear Models
In this tutorial, we will discuss diagnostic plots for linear
regression. However, prior to doing that, we need to extract out the
diagnostic information from the linear model. This is done using a
ggplot2 function called fortify().
The fortify() function
This data frame contains the predicted values (.fitted)
as well as the residuals (.resid) and standardised
residuals (.stdresid). There are other bits of information
there, such as the leverage (.hat) of each point, but these
will not be needed at the level of these tutorials.
fort_df <- fortify(linear_model)
You can see an excerpt of the fortified data below.
head(fort_df)
Model Diagnostics
Regression diagnostics allow you to check whether your fitted model satisfies the assumptions of linear regression.
Residuals vs Fitted
The residuals vs fitted plot is a diagnostic that can be used to assist checking two assumptions:
- The zero mean assumption: the mean of the residuals must be zero at all points along this graph.
- The homogeneity of variances (homoscedasticity) assumption: the variance (spread) of the residuals must be approximately equal across the whole plot.
The residuals vs fitted plot can be produced using the fortified data frame from the previous section.
ggplot(data = fort_df, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) + # this makes it easier to check the zero mean assumption
labs(x = "Fitted values", y = "Residuals") +
theme_bw()
Note that both the x and y axis in this plot have the same units as the y variable in your linear regression model. However, we do not usually put the units on the axis labels of this graph.
The Quantile-Quantile (QQ) Plot
The normal quantile-quantile plot, also called a qq-plot, allows us to check whether the residuals from a linear regression model are normally distributed.
ggplot(data = fort_df, aes(sample = .stdresid)) +
stat_qq() + # use stat_qq to plot the qq-plot
geom_abline(slope = 1, intercept = 0) + # draws the one-to-one comparison line
theme_bw()
The residuals are considered normally distributed if the points on this plot fall along the straight line with an intercept of 0 and a slope of 1.
It is good practice to update the axis labels and make them more information, like so.
ggplot(data = fort_df, aes(sample = .stdresid)) +
stat_qq() + # use stat_qq to plot the qq-plot
geom_abline(slope = 1, intercept = 0) + # draws the one-to-one comparison line
labs(x = "q values from standard normal",
y = "q values from standardised residuals") +
theme_bw()
Assessing Goodness-of-Fit
Often when fitting linear models, we do not just try one and leave it at that. There is usually some comparison between many nested models involved.
Comparing Nested Models with an \(F\)-Test: the anova()
function
The anova() function can be used to perform an \(F\)-test with the null hypothesis that some
linear model explains the same variability (for the population, not the
data) as another linear model that is nested inside it (i.e.,
some parameters are set to zero).
Below, we compare the model: \[{\sf mpg} =
\beta_0 + \beta_1 \cdot {\sf wt} + \beta_2\cdot{\sf drat} + \beta_3\cdot
{\sf wt}\cdot {\sf drat} \] against the model \[{\sf mpg} = \beta_0 + \beta_1 \cdot {\sf
wt}.\] using variables from the mtcars dataset.
model1 <- lm(data = mtcars, mpg ~ wt * drat)
model2 <- lm(data = mtcars, mpg ~ wt)
anova(model2, model1)
It is important to note that the two models must be nested for the interpretation of the \(p\)-value to match the description of the alternative hypothesis above.
An advantage of using an \(F\)-test is that it can compare models with a large difference in the number of terms. This can help avoid errors due to multiple comparisons: even if the null hypothesis is true, 5% of the time one will reject it at the \(\alpha=0.05\) level (so one wants to avoid doing too many tests if possible, recall the jellybean example cartoon.
Combining plots into one figure: The patchwork
package
The patchwork package is a very neat package that makes
it extremely simple to put multiple plots made with
ggplot() into the same figure with the
layout you desire. Using it is quite easy: save each plot you create as
an object (below it is done with three plots, called p1,
p2, and p3), and then, once the
patchwork package is loaded, you can display the plots
together in the desired format by mentioning the plots using
+ and / syntax.
For example, p1+p2+p3 would then display all three plots
next to each other, whereas p1/(p2+p3) displays the first
plot on the first row, and then the other two on the second row.
The code below fits a linear model using the iris data,
and visualises the points and the fit line, as well as two regression
diagonostics, all within one figure.
# plot the scatterplot of the data
p1 <- iris %>% # here we use the pipe operator to push the iris data into the next bit of code
ggplot(aes(x = Petal.Width, y = Petal.Length)) +
geom_point(aes(color = Species), alpha = 0.7) +
geom_smooth(method = "lm") +
theme_bw()
# fit the model and extract the residuals
model <- lm(data = iris, Petal.Length ~ Petal.Width)
model.fort <- model %>%
fortify()
# create the resid v fitted plot
p2 <- ggplot(data = model.fort, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) + # this makes it easier to check the zero mean assumption
labs(x = "Fitted values", y = "Residuals") +
theme_bw()
# Plot the normal distribution
p3 <- ggplot(
data = model.fort,
aes(sample = .stdresid)
) +
stat_qq() +
geom_abline() + # geom_abline actually defaults to int = 0 and slope = 1
labs(x = "q values from std. normal",
y = "q values from std. resid.") +
theme_bw()
# stitch it all together
library(patchwork)
p1 / (p2 + p3)