Tutorial 5 - Linear Regression 1
Introduction
Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.
NOTE: If following along on your own machine in RStudio you will need to make sure the following packages are loaded (and, if not already, installed).
library(tidyverse)
library(gapminder)
library(GGally)
Here we will also introduce the GGally package. Make sure that this is installed and load this along with the others.
# install.packages("GGally")
library(GGally)
Linear regression refers to the process of estimating the parameters of a linear relationship between two or more variables. It is a cornerstone of statistical analysis in science.
This section exhibits the R commands that are used for fitting and checking linear models.
Suppose you have a data set with \(n\) observations, one response variable, and one explanatory variable. Let \(y_i\) denote the \(i\)-th value of a response variable and let \(x_i\) be the \(i\)-th value of the explanatory variable, for \(i = 1, ..., n\). Then, we simply write the linear model as, \[\begin{align*} y_i &= \beta_0 + \beta_1 x_i + \varepsilon_i,\\ \varepsilon_i &\sim \text{Normal}(0, \sigma^2), \end{align*}\] where \(\varepsilon_i\) is the \(i\)-th residual or error, which is assumed to be normally distributed with mean 0 and variance \(\sigma^2\). The error term is included to account for the fact that a single straight line specified in the equation above is unlikely to perfectly predict the response variable.
Exploratory Analysis with ggpairs()
Before fitting any linear regression models, it is good practice to plot the relationships between the variables in your data set. This helps to confirm that
- Relationships between variables of interest in your data set do, in fact, exist; and
- The relationships between variables of interest in your data set are, in fact, linear.
The simplest way to do this for your data set is to use the ggpairs() function from the GGally package. This is most useful when applied to the quantitative variables (either continuous or count variables) in your data set. Note that it may take a while to run this command for data sets with many columns.
library(gapminder)
library(GGally)
# create a vector of variables to plot in your pairs plot
quant_vars <- c("year", "lifeExp", "pop", "gdpPercap")
# create the pairs plot
ggpairs(
gapminder, # the data set
columns = quant_vars # the subset of columns to make the plot with
) +
theme_bw() # we can also add themes to these plots to make them nicer to read
Other Exploratory Analysis
Along with the ggppairs(), plot, it is often also worthwhile to plot the data you wish to model as a scatterplot. This way you can see if the data you wish to model actually looks linear or not. We have looked at the plots of the gapminder data before, in tutorial 4, but here is a reminder.
# let's take a look at the plot of gdp by life expectancy
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.3) +
labs(x = "GDP per capita (USD)", y = "Life expectancy (years)") +
theme_bw()
This data doesn’t look linear at all. If you recall, in tutorial 4, we transformed the data using a log transformation. Note that, in the plots for the “Transformations of variables” section of tutorial 4 on visualisation, we used log10() as our transformation. However, the natural logarithm is more common so that is what we use here.
ggplot(data = gapminder, aes(x = log10(gdpPercap), y = lifeExp)) +
geom_point(alpha = 0.3) +
labs(
x = expression(log[e]~GDP~per~capita~(log[e](USD))),
y = "Life expectancy (years)"
) +
theme_bw()
Now the data appears to follow a more linear trend, and so it is this natural log data that we will be using in further analyses here.
Computing Correlation and Correlation Matrices: cor()
The sample correlation between two continuous variables in a data frame can be computed using the cor() function. Let’s look at an example using the iris data set.
cor(gapminder$lifeExp, log(gapminder$gdpPercap))
The cor() function can also be used to compute correlation matrices. However, one must make sure that all variables are numerical. If there are any categorical variables, it will return an error. The issue can be resolved by subsetting the columns corresponding to all (or some) of the continuous variables.
cts_vars <- c("year", "lifeExp", "pop", "gdpPercap")
# gapminder[,cts_vars] is a reduced form of the dataframe gapminder with
# only the columns in cts_vars
cor(gapminder[,cts_vars])
Correlation Heatmaps : ggcorrplot()
A visually appealing and informative way of visualizing correlation between continuous variables if through a heatmap.
#install.packages("ggcorrplot")
library(ggcorrplot)
v <- c("year", "lifeExp", "pop", "gdpPercap") # variables to display in heatmap
ggcorrplot(cor(gapminder[,v]),
method="circle",
colors = c("red", "white","blue"))
A Word of Caution
CORRELATION DOES NOT EQUATE TO CAUSATION! Just because two variables may be highly positively or negatively correlated, it does not mean that they have any effect on one another. This is an easy trap to fall into. If are interested in seeing some ludicrous examples of correlated variables, check out Spurious Correlations.
Linear Regression in R with lm()
The lm() command fits linear models. The command is spelled with a lower-case “L” and a lower-case “M”. The following subsections illustrate how to use the lm() command.
The lm() function has its own syntax that is unlike most code you have seen in this tutorial set (but see facet_wrap() and facet_grid() in tutorial 4). This is because the lm() function uses formula objects to specify relationships between variables. In general, a formula looks like y ~ x, where y is a response variable and x is an explanatory variable. The squiggly line is called a .
In general, lm() commands are written like this:
model <- lm(y ~ x, data = DATASET)
In the code above, y, x and DATASET need to be replaced with appropriate object names in your own code.
The following example uses the gapminder dataset and fits a relationship between the response variable, life expectancy (in years), and the explanatory variable, the natural logarithm of GDP per capita.
linear_model <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
You may notice there is no output for the above code, but this is simply because we have stored the output in the object called linear_model. If you are following along in RStudio, you will see this object appear in the working environment. This, along with the lack of error outputs, tells you that the code has at least run successfully.
Inspecting the lm() output
Having fitted a linear model, it is important to know the estimates of the parameters, \(\hat\beta_0\) and \(\hat\beta_1\) (and the corresponding hypothesis tests). The “hats” on the parameters indicate these are estimated values. We can access this information using functions from the broom package, or by using the summary() function.
Each linear model is also associated with a set of “model-fit statistics”. These summarise, among other things, how well your linear model fits your data. Functions from the brrom package can be used to extract these, as can the summary() function. See below for instructions on how to use these.
Using the broom package
The broom package has two main functions which are used to extract the information for the estimates, and the information for the model-fit statistics.
The function tidy() prints a table of the estimates, and the function glance() gives use the model-fit statistics. Each of these functions take a model object as input. Oyr model object was created above, and we called it linear_model. Let’s take a look at it.
library(broom)
# pull out the model estimates
tidy(linear_model)
# and the model fit statistics
glance(linear_model)
As you can see by this output, broom functions print the output in a good, easy-to-read tabulated format. These are the preferred functions for accessing this information, however, you may wish to see all the info using only one function call and this can be done using summary().
Using summary()
The summary() function works in the same way as the broom functions in that it takes a model object as the input. This function, however, does not return the info in a good, easy-to-read, tabulated formatted and you need to sift through some info you may not want or need.
summary(linear_model)
In this output, you can find the coefficient info under the heading coefficients: and the info you might otherwise find using glance() in the final paragraph of the output. This summary also returns the boxplot info for the residuals of the model. The residuals, and how to handle these, will be discussed in Tutorial 6 - Linear Regression 2.
Polynomial Models: using poly() inside of lm()
Polynomial models allow the modeling of curvature in the context of a linear model. As a special case of the polynomial model, here, we consider quadratic models, which are polynomial models of degree 2 (the highest power applied to the \(x\) variable). A quadratic model is written as,
\[\begin{align*} y_i &= \beta_0 + \beta_1x_i + \beta_2x_i^2 + \varepsilon_i\\ \varepsilon_i &\sim \text{Normal}(0, \sigma^2). \end{align*}\]
These models are considered linear models because they are , even if their predictions do not fall in a straight line. In R, they can be fitted using a combination of the lm() and poly() functions.
polynomial_model <- lm(
lifeExp ~ poly(log(gdpPercap), 2, raw = T),
# raw = T is needed so R doesn't standardise the x variable
data = gapminder
)
The tidy() and glance() functions work the same way on a polynomial model as they would for a linear model.
# access the estimates for the model
tidy(polynomial_model)
# access the model-fit statistics
glance(polynomial_model)
Confidence Intervals
Every hypothesis test has an associated confidence interval, and the hypothesis test on the model parameter estimates are no exception, however, these are not displayed by default using the tidy() or summary() functions. They can be displayed as part of the tidy() output by add some more inputs to the function.
These arguments are the conf.int and conf.level arguments. conf.int should be a boolean input, telling R whether or not to include the intervals, and conf.level specifies the confidence level, which defaults to 95%.
Let’s take a look at the confidence intervals for the first linear model we created.
tidy(linear_model, conf.int = TRUE, conf.level = 0.95)
# try yourself to display the 80% confidence intervals for the same model
Exercise
It’s your turn! Write some code below to generate the confidence intervals at both 90% and 95% for the polynomial model we created earlier.
Prediction: using the predict() function on a model
Predictions from a linear model can be obtained via the function predict(). Imagine we want to predict the life-expectancy in years of a country where the GDP per capita is 60,000 USD. Let’s use the polynomial model (though this process is the same for a simple linear model).
newdata <- data.frame(gdpPercap = 60000)
# need to set up a dataframe
# the dataframe must have the same column name as the x
# variable(s) in the lm command
predictions <- predict(polynomial_model, newdata)
predictions
Exercise
In the code window below, use the original linear model to predict the life-expectancy in years of a country where the GDP per capita is 75,000 USD.
Confidence and prediction intervals with predict()
The predict() function can also output confidence intervals and prediction intervals.
newdata <- data.frame(gdpPercap = 60000)
# need to set up a dataframe
# the dataframe must have the same column name as the x
# variable(s) in the lm command
predictions <- predict(
polynomial_model,
newdata,
interval = "confidence" # set to "confidence" for 95% CI
)
predictions
predictions <- predict(
polynomial_model,
newdata,
interval = "prediction" # set to "prediction" for 95% PI
)
predictions
Exercise
Use predict to find the 95% confidence and prediction intervals for the original linear model for a country where the GDP per capita is 75,000 USD.
Plotting Prediction Intervals
Plotting prediction intervals can be an involved process. First, you must produce a data frame of predictions and the prediction intervals via the predict() function. In the example below, life expectancies of countries with GDPs per capita between 100 and 90,000 USD are predicted. We also obtain the 95% prediction intervals.
newdata <- data.frame(
gdpPercap = 100:90000 # this : creates a sequence incremented by 1 each step
)
plot_preds <- predict(
polynomial_model,
newdata,
interval = "prediction" # set to "prediction" for 95% PI
)
head(plot_preds)
Second, the data frame of predictions must be joined to the data frame of x values (in this case, newdata) using the bind_cols() command. This command glues together the columns of two data frames with the same number of rows. In order for this to work with the correct headings, both objects entered as input must be dataframes.
# check the class of plot-preds
class(plot_preds)
# it looks like predict has output a matrix object
# convert this to a dataframe
plot_preds <- data.frame(plot_preds)
# then we can join the data frames together using bind_cols
plot_preds_joined <- bind_cols(newdata, plot_preds)
head(plot_preds_joined)
Finally, the prediction intervals can be plotted on top of the existing data. However, the ggplot code must be structured in a strange way. See below:
ggplot() + # ggplot command is empty
# Plot the data
geom_point(
alpha = 0.3,
data = gapminder,
aes(x = log(gdpPercap), y = lifeExp)
) +
# Plot the predictions
geom_line(
col = "red",
data = plot_preds_joined,
aes(x = log(gdpPercap), y = fit)
) +
# Plot the prediction interval
geom_ribbon(
col = NA,
fill = "red",
alpha = 0.2,
data = plot_preds_joined,
aes(x = log(gdpPercap), ymin = lwr, ymax = upr)
) +
labs(
x = expression(log~GDP~per~capita~(log(USD))),
y = "Life expectancy (years)"
) +
ylim(0, 100) + # set y limits
theme_bw()
Test Your Knowledge
You can use the R window below to help find the answers to some quiz questions.