Tutorial 4 - Visualisation

Introduction

Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.

This tutorial will primarily teach you how to use the ggplot2 package for creating meaningful visualisations of data, however, there may also be functions from dplyr (from the previous tutorial), and some data from the gapminder package. ggplot2 and dplyr both form part of the larger package space, tidyverse, so to simplify the code, we can load this in a single line.

Remember: you only need to install a package if when trying to load it, an error is returned stating that the package does not exist.

#install.packages("tidyverse")
library(tidyverse)
#install.packages("gapminder")
library(gapminder)

The `ggplot2` package

The most basic use of the ggplot2 package is

ggplot()

Running the command above will generate an empty plot.

If you have data, you can add it to the plot at this stage. For the following, the Loblolly data set will be used to demonstrate ideas in ggplot2.

ggplot(
  data = Loblolly, # this MUST be a data frame
  aes(
    x = age, # what column from your data do you want on the x-axis?
    y = height # what column from your data do you want on the y-axis?
  )
)

There is nothing on this plot. However, notice that the x-axis and y-axis have been scaled to the limits of the age and height columns from the Loblolly data set.

Once you have added data to the plot in this fashion, you need to tell ggplot how you want to draw that data. This is explored in the sections to come.

Making histograms with `geom_histogram()`

A histogram is a good way to display the variability in a single variable.

The following code produces a histogram of the height column in the Loblolly data set.

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give an x-axis for histograms
) +
  geom_histogram() # This draws the histogram

You can modify the histogram by changing several options in geom_histogram().

For example, we can set the option binwidth to change the size of the bars:

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_histogram(
    binwidth = 5 # the width of each bin will be 5 
  )

We can also change the options colour and fill to change the outline and in-fill colour of the bars, respectively:

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_histogram(
    colour = "black", # the outline colour on the bars
    fill = "grey"     # the in-fill colour in the bars
  )

The default behaviour of geom_histogram() is to plot the count on the y-axis, but in many applications for statistics, what we actually want is a density. The count simply makes the bar height equal to the number of observations which fall into a particular bin, however, density forces the bar area to be representative of the propertion of the total dataset which falls into a bin.

This can be achieves by setting the y-axis to be ..density.. (yes, including the dots) inside the geom_histogram() call.


ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_histogram(
    aes(y = ..density..),
    colour = "black", # the outline colour on the bars
    fill = "grey"     # the in-fill colour in the bars
  )

What you should notice here is that everything looks the same except for the scale on the y-axis.

Making Boxplots with `geom_boxplot()`

Similar to a histogram, a boxplot is also a fine way to show the variability in a single variable.

The following code produces a boxplot of the height variable in the Loblolly data set.

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_boxplot() # this creates the boxplot

As with the histogram, we can change the outline and in-fill colour of the boxplot.

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_boxplot(
    colour = "black", # the outline colour on the boxplot
    fill = "violet"     # the in-fill colour in the boxplot
  )

We can also remove the redundant information (labels, ticks, etc.) on the y-axis as follows. (Note that this requires a fairly advanced level of plot manipulation that will be covered in more detail later in this tutorial.)

ggplot(
  data = Loblolly, # Provide a data set
  aes(x = height)  # You only need to give histograms an x-axis
) +
  geom_boxplot(
    colour = "black", # the outline colour on the boxplot
    fill = "violet"     # the in-fill colour in the boxplot
  ) +
  theme(
    axis.text.y = element_blank(), # removes y axis text
    axis.ticks.y = element_blank() # removes y axis breaks
  )

Boxplots can be vertical or horizontal, and you dictate this by specifying the axis. For example, all the previous boxplots here have been horizontal because we have specified our variable to run along the x axis. To make a vertical boxplot, we simply specify the y axis instead (and then making the corresponding changes in your theme code).


ggplot(
  data = Loblolly, # Provide a data set
  aes(y = height)  # You only need to give histograms an x-axis
) +
  geom_boxplot(
    colour = "black", # the outline colour on the boxplot
    fill = "violet"     # the in-fill colour in the boxplot
  ) +
  theme(
    axis.text.x = element_blank(), # removes y axis text
    axis.ticks.x = element_blank() # removes y axis breaks
  )

Making Scatterplots with `geom_point()`

A scatter plot shows the relationship between two variables.

The code below demonstrates how to make a scatter plot of the relationship between height and age in the Loblolly data set.

ggplot(
  data = Loblolly, 
  aes(
    x = age, # independent variable on the x-axis
    y = height # dependent variable on the y-axis
  ) 
) +
  geom_point() # this draws a scatter plot

We can introduce a third variable into a scatter plot by mapping a variable in your data set onto the colour option inside geom_point(). Here, we set the colour of the points based on the Seed variable.

ggplot(
  data = Loblolly, 
  aes(
    x = age, # independent variable on the x-axis
    y = height # dependent variable on the y-axis
  ) 
) +
  geom_point(
    aes(colour = Seed) # this colors the points by the Seed variable
  )

Plotting with Small Multiples

The facet_wrap() function can be used to split up a plot into small multiples using a categorical variable.

Below, we demonstrate this by splitting up a scatter plot from the previous section into small windows based on the Seed variable from the Loblolly data set.

ggplot(
  data = Loblolly, 
  aes(
    x = age, # independent variable on the x-axis
    y = height # dependent variables on the y-axis
  ) 
) +
  geom_point() + # this draws a scatter plot
  facet_wrap(~ Seed) # this creates multiple windows by Seed

The facet_grid() function then allows you to split by yet another variable and form a grid of plots.


ggplot(
  data = mtcars, 
  aes(
    x = wt, 
    y = mpg
  )
) + 
  geom_point() + 
  facet_grid(gear~am)

As you can see here, not all of the small multiples contain data points. This is simply because there isn’t data in those groups and so there is no information to plot. Typically, it is best to use facet_grid() in print only if there is actually data in each of the groups.

Plotting a line of Best Fit

The function geom_smooth() can be used to add lines of best fit to your scatter plots.

The default behaviour of geom_smooth() is to add a moving average curve to the scatter plot:

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() + 
  geom_smooth()

By setting the option method = lm, we can force this line to be straight:

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() + 
  geom_smooth(method = lm) # gives a straight line of best fit

Notice the grey envelope around the blue line of best fit. This envelope displays the 95% confidence interval around the line of best fit. Sometimes, you may want to remove the envelope for aesthetic reasons. In this case, you can set the option se = FALSE:

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() + 
  geom_smooth(
    method = lm,
    se = FALSE # removes the grey envelope
  )

If you have a plot where your data set is split up into small windows (see the section on facet_wrap()), you can use geom_smooth() to give you a different line of best fit for each window.

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() + 
  facet_wrap(~ Seed) + # this creates multiple windows by Seed
  geom_smooth() # individual lines for each window

This also works when you set method = lm as an option inside geom_smooth(). In this case, the line of best fit in each window is allowed to have its own slope and intercept parameters.

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() + 
  facet_wrap(~ Seed) + # this creates multiple windows by Seed
  geom_smooth(method = lm) # each line has its own slope and intercept

Plotting Functions with `stat_function()`

In some cases we might be interested in plotting particular functions over some previously plotted data. For this, we use a function called stat_function(). A common use for this is when we want to compare some data in a histogram to a normal distribution with a particular mean and standard deviation. stat_function() assumes the functions first input is the x axis values, and any other arguments must be specified as a list().

ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(
      aes(y = ..density..),     # to compare to dnorm we need density 
      binwidth = 3, 
      color = "black", 
      fill = "grey60"
    ) +
  stat_function(fun = dnorm,   # the name of the function to be plotted  
                args = list(   # if the function has extra arguments they must ve specified as a list
                  mean = 20, 
                  sd = 6
                ), 
                color = "red") # colours help the function stand out from the data

Updating Plot Features

Axis Labels and Titles

It is important to put an informative title and axis labels on your plots.

Below, the addition of axis labels and a title are demonstrated using the basic scatter plot from the section on scatter plots.

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() +
  labs( # this function adds labels
    x = "Age (years)", # the x-axis label (note the quotation marks)
    y = "Height (m)",  # the y-axis label (note the quotation marks)
    title = "Loblolly tree growth" # overall title of the plot (note the quotation marks)
  )

Axis Transformations

Sometimes, it is necessary to transform variables in a data set to produce linear relationships. For example, consider the relationship between life expectancy and GDP per capita in the gapminder data set.

# let's take a look at the plot 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.3) +
  labs(x = "GDP per capita (USD)", y = "Life expectancy (years)")

This is clearly not a linear relationship. However, it can be transformed into a linear relationship by applying the log10() function to the variable inside the aes(). Be careful about axis labels because directly log-transforming any axis changes its units!

ggplot(data = gapminder, aes(x = log10(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.3) +
  labs(
    x = expression(log[10]~GDP~per~capita~(log[10](USD))), 
    y = "Life expectancy (years)"
  )

The most common transformations are the log() (natural log, that is, base \(e\)), log10() (log base 10), and sqrt() (square root).

Transformations of variables (logarithm axis scales, `scale_x_log10()`/`scale_y_log10()`)

It is possible to apply a log-transformation without directly transforming the axis scales. We can use a \(\log_{10}\) scale on the x-axis with scale_x_log10(). This only condenses the numbers on the x-axis, and it does not change the units of the axis labels.

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.3) +
  scale_x_log10() + # condenses the x axis scale to be logarithmic
  labs(
    x = "GDP per capita (USD)", 
    y = "Life expectancy (years)"
  )

Themes: `theme_bw()` and others

Themes are sets of aesthetic elements (e.g., background colour, grid outlines, etc.) that set the tone of a plot. There are several pre-prepared themes in ggplot2, including

theme_bw() - a black and white theme
theme_classic() - an old-timey theme
theme_linedraw() - a theme that draws very strong gridlines in the background
theme_dark() - a black and dark-grey theme

The use of theme_bw() is recommended for this unit.

ggplot(data = Loblolly, aes(x = age, y = height)) +
  geom_point() +
  labs( 
    x = "Age (years)", 
    y = "Height (m)", 
    title = "Loblolly tree growth"
  ) +
  theme_bw() # adds the black-and-white theme

Test Your Knowledge

You can use the R window below to help find the answers to some quiz questions.

Quiz

Introduction

The ggplot2 package

Making histograms with geom_histogram()

Making Boxplots with geom_boxplot()

Making Scatterplots with geom_point()