
Introduction
Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.
This tutorial will primarily teach you how to use the
ggplot2
package for creating meaningful visualisations of
data, however, there may also be functions from dplyr
(from
the previous tutorial), and some data from the gapminder
package. ggplot2
and dplyr
both form part of
the larger package space, tidyverse
, so to simplify the
code, we can load this in a single line.
Remember: you only need to install a package if when trying to load it, an error is returned stating that the package does not exist.
#install.packages("tidyverse")
library(tidyverse)
#install.packages("gapminder")
library(gapminder)
The ggplot2
package
The most basic use of the ggplot2
package is
ggplot()
Running the command above will generate an empty plot.
If you have data, you can add it to the plot at this stage. For the
following, the Loblolly
data set will be used to
demonstrate ideas in ggplot2
.
ggplot(
data = Loblolly, # this MUST be a data frame
aes(
x = age, # what column from your data do you want on the x-axis?
y = height # what column from your data do you want on the y-axis?
)
)
There is nothing on this plot. However, notice that the x-axis and
y-axis have been scaled to the limits of the age
and
height
columns from the Loblolly
data set.
Once you have added data to the plot in this fashion, you need to
tell ggplot
how you want to draw that data. This is
explored in the sections to come.
Making histograms with geom_histogram()
A histogram is a good way to display the variability in a single variable.
The following code produces a histogram of the height
column in the Loblolly
data set.
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give an x-axis for histograms
) +
geom_histogram() # This draws the histogram
You can modify the histogram by changing several options in
geom_histogram()
.
For example, we can set the option binwidth
to change
the size of the bars:
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_histogram(
binwidth = 5 # the width of each bin will be 5
)
We can also change the options colour
and
fill
to change the outline and in-fill colour of the bars,
respectively:
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_histogram(
colour = "black", # the outline colour on the bars
fill = "grey" # the in-fill colour in the bars
)
The default behaviour of geom_histogram()
is to plot the
count
on the y-axis, but in many applications for
statistics, what we actually want is a density. The count
simply makes the bar height equal to the number of observations which
fall into a particular bin, however, density
forces the bar
area to be representative of the propertion of the total dataset which
falls into a bin.
This can be achieves by setting the y-axis to be
..density..
(yes, including the dots) inside the
geom_histogram()
call.
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_histogram(
aes(y = ..density..),
colour = "black", # the outline colour on the bars
fill = "grey" # the in-fill colour in the bars
)
What you should notice here is that everything looks the same except for the scale on the y-axis.
Making Boxplots with geom_boxplot()
Similar to a histogram, a boxplot is also a fine way to show the variability in a single variable.
The following code produces a boxplot of the height
variable in the Loblolly
data set.
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_boxplot() # this creates the boxplot
As with the histogram, we can change the outline and in-fill colour of the boxplot.
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_boxplot(
colour = "black", # the outline colour on the boxplot
fill = "violet" # the in-fill colour in the boxplot
)
We can also remove the redundant information (labels, ticks, etc.) on the y-axis as follows. (Note that this requires a fairly advanced level of plot manipulation that will be covered in more detail later in this tutorial.)
ggplot(
data = Loblolly, # Provide a data set
aes(x = height) # You only need to give histograms an x-axis
) +
geom_boxplot(
colour = "black", # the outline colour on the boxplot
fill = "violet" # the in-fill colour in the boxplot
) +
theme(
axis.text.y = element_blank(), # removes y axis text
axis.ticks.y = element_blank() # removes y axis breaks
)
Boxplots can be vertical or horizontal, and you dictate this by
specifying the axis. For example, all the previous boxplots here have
been horizontal because we have specified our variable to run along the
x
axis. To make a vertical boxplot, we simply specify the
y
axis instead (and then making the corresponding changes
in your theme code).
ggplot(
data = Loblolly, # Provide a data set
aes(y = height) # You only need to give histograms an x-axis
) +
geom_boxplot(
colour = "black", # the outline colour on the boxplot
fill = "violet" # the in-fill colour in the boxplot
) +
theme(
axis.text.x = element_blank(), # removes y axis text
axis.ticks.x = element_blank() # removes y axis breaks
)
Making Scatterplots with geom_point()
A scatter plot shows the relationship between two variables.
The code below demonstrates how to make a scatter plot of the
relationship between height
and age
in the
Loblolly
data set.
ggplot(
data = Loblolly,
aes(
x = age, # independent variable on the x-axis
y = height # dependent variable on the y-axis
)
) +
geom_point() # this draws a scatter plot
We can introduce a third variable into a scatter plot by mapping a
variable in your data set onto the colour
option inside
geom_point()
. Here, we set the colour of the points based
on the Seed
variable.
ggplot(
data = Loblolly,
aes(
x = age, # independent variable on the x-axis
y = height # dependent variable on the y-axis
)
) +
geom_point(
aes(colour = Seed) # this colors the points by the Seed variable
)
Plotting with Small Multiples
The facet_wrap()
function can be used to split up a plot
into small multiples using a categorical variable.
Below, we demonstrate this by splitting up a scatter plot from the
previous section into small windows based on the Seed
variable from the Loblolly
data set.
ggplot(
data = Loblolly,
aes(
x = age, # independent variable on the x-axis
y = height # dependent variables on the y-axis
)
) +
geom_point() + # this draws a scatter plot
facet_wrap(~ Seed) # this creates multiple windows by Seed
The facet_grid()
function then allows you to split by
yet another variable and form a grid of plots.
ggplot(
data = mtcars,
aes(
x = wt,
y = mpg
)
) +
geom_point() +
facet_grid(gear~am)
As you can see here, not all of the small multiples contain data
points. This is simply because there isn’t data in those groups and so
there is no information to plot. Typically, it is best to use
facet_grid()
in print only if there is actually data in
each of the groups.
Plotting a line of Best Fit
The function geom_smooth()
can be used to add lines of
best fit to your scatter plots.
The default behaviour of geom_smooth()
is to add a
moving average curve to the scatter plot:
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
geom_smooth()
By setting the option method = lm
, we can force this
line to be straight:
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
geom_smooth(method = lm) # gives a straight line of best fit
Notice the grey envelope around the blue line of best fit. This
envelope displays the 95% confidence interval around the line of best
fit. Sometimes, you may want to remove the envelope for aesthetic
reasons. In this case, you can set the option
se = FALSE
:
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
geom_smooth(
method = lm,
se = FALSE # removes the grey envelope
)
If you have a plot where your data set is split up into small windows
(see the section on facet_wrap()
), you can use
geom_smooth()
to give you a different line of best fit for
each window.
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
facet_wrap(~ Seed) + # this creates multiple windows by Seed
geom_smooth() # individual lines for each window
This also works when you set method = lm
as an option
inside geom_smooth()
. In this case, the line of best fit in
each window is allowed to have its own slope and intercept
parameters.
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
facet_wrap(~ Seed) + # this creates multiple windows by Seed
geom_smooth(method = lm) # each line has its own slope and intercept
Plotting Functions with stat_function()
In some cases we might be interested in plotting particular functions
over some previously plotted data. For this, we use a function called
stat_function()
. A common use for this is when we want to
compare some data in a histogram to a normal distribution with a
particular mean and standard deviation. stat_function()
assumes the functions first input is the x axis values, and any other
arg
uments must be specified as a list()
.
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(
aes(y = ..density..), # to compare to dnorm we need density
binwidth = 3,
color = "black",
fill = "grey60"
) +
stat_function(fun = dnorm, # the name of the function to be plotted
args = list( # if the function has extra arguments they must ve specified as a list
mean = 20,
sd = 6
),
color = "red") # colours help the function stand out from the data
Updating Plot Features
Axis Labels and Titles
It is important to put an informative title and axis labels on your plots.
Below, the addition of axis labels and a title are demonstrated using the basic scatter plot from the section on scatter plots.
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
labs( # this function adds labels
x = "Age (years)", # the x-axis label (note the quotation marks)
y = "Height (m)", # the y-axis label (note the quotation marks)
title = "Loblolly tree growth" # overall title of the plot (note the quotation marks)
)
Axis Transformations
Sometimes, it is necessary to transform variables in a data set to
produce linear relationships. For example, consider the relationship
between life expectancy and GDP per capita in the gapminder
data set.
# let's take a look at the plot
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.3) +
labs(x = "GDP per capita (USD)", y = "Life expectancy (years)")
This is clearly not a linear relationship. However, it can be
transformed into a linear relationship by applying the
log10()
function to the variable inside the
aes()
. Be careful about axis labels because directly
log-transforming any axis changes its units!
ggplot(data = gapminder, aes(x = log10(gdpPercap), y = lifeExp)) +
geom_point(alpha = 0.3) +
labs(
x = expression(log[10]~GDP~per~capita~(log[10](USD))),
y = "Life expectancy (years)"
)
The most common transformations are the log()
(natural
log, that is, base \(e\)),
log10()
(log base 10), and sqrt()
(square
root).
Transformations of variables (logarithm axis scales,
scale_x_log10()
/scale_y_log10()
)
It is possible to apply a log-transformation without directly
transforming the axis scales. We can use a \(\log_{10}\) scale on the x-axis with
scale_x_log10()
. This only condenses the numbers on the
x-axis, and it does not change the units of the axis labels.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.3) +
scale_x_log10() + # condenses the x axis scale to be logarithmic
labs(
x = "GDP per capita (USD)",
y = "Life expectancy (years)"
)
Themes: theme_bw()
and others
Themes are sets of aesthetic elements (e.g., background colour, grid
outlines, etc.) that set the tone of a plot. There are several
pre-prepared themes in ggplot2
, including
theme_bw()
- a black and white themetheme_classic()
- an old-timey themetheme_linedraw()
- a theme that draws very strong gridlines in the backgroundtheme_dark()
- a black and dark-grey theme
The use of theme_bw()
is recommended for this unit.
ggplot(data = Loblolly, aes(x = age, y = height)) +
geom_point() +
labs(
x = "Age (years)",
y = "Height (m)",
title = "Loblolly tree growth"
) +
theme_bw() # adds the black-and-white theme
Test Your Knowledge
You can use the R window below to help find the answers to some quiz questions.