Tutorial 3 - Data Manipulation in R
Introduction
Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.
Data in R
A data frame is a way of storing a table of information, and much of
what we do in R will involve data frames. In this tutorial you will
learn how to extract information out of data frames, as well as import
data frame from elsewhere on your computer, build new data frames, use
built-in data frames, and manipulate data frames in a number of
different ways using the dplyr package. This might include
calculating new variables, filtering by some condition, and other
general data tidying. We will also talk a little on some common errors
people make when learning how to deal with data frames. Finally, we will
introduce a few functions from the forcats package that may
help you to wrangle any categorical variable you may come across.
Setting Up
We should start our session by loading any packages we plan on using.
Today, we are learning about the dplyr and
forcats packages, and also looking at a few functions from
the stringr package, so make sure that these are installed
and then load them using the following code.
#install.packages("dplyr")
library(dplyr)
#install.packages("forcats")
library(forcats)
#install.packages("stringr")
library(stringr)
Creating data frames using data.frame()
To create one we use the function data.frame(). For
example, I would like to store my 5 favorite customers birthdays so that
I can be sure to remember them.
# Let'd first store my customers names in a vector to make things easier for me
Customers <- c('Rachel', 'Rita', 'Dorothy', 'Chien-Shiung', 'Gertrude')
# The I want to create my data frame
Birthdays <- # This line will save in R whatever follows as Birthdays
data.frame( #this opens the function
Names = Customers, # Creates a column (Names) from vector called Customers
Months = c('May', 'April', 'May', 'May', 'January'), # Creates a column (Months) with new data
Days = c(27, 22, 12, 31, 23) # Creates a column Days with new data
)
Birthdays #This will make R show the dataframe we have just saved as Birthdays
Individual columns of a data frame can then be referenced by using the name of the data frame and the column name and separating them with a ‘$’.
Birthdays$Names #This will display the Names column
Birthdays$Days #This will display the Days column
# Try and access the Months column in the next line...
This referencing can also be used to add a column to a saved data frame. For example if we wished to add the amount our customers have spent, which we previously saved as a vector called Spendings to our data frame we can do this by declaring that a new column of Birthdays called Spent is equal to our saved vector.
Birthdays$Spent <- #Everything after this will fill the column Spending
c(473, 826.92, 932, 1040, 273.36)
Birthdays
Specific elements of a data frame can be referenced using this method
and adding the row number of the element inside of the square brackets
[ ].
Birthdays$Names[3] #Third row of Names column
Birthdays$Spent[5] #Fifth row of Spent column
This referencing can also be used to alter our data frame. For example, if Dorothy wanted to be called by her nickname Dottie we can change this in our data frame as we know from above where her name is stored, in the data frame ‘Birthdays’, in the column Names in the Third row.
Birthdays$Names[3] <- "Dottie"
Birthdays$Names[3] #Third row of Names column
Specific elements of a data frame -those referenced by their data frame name, column name and row placement- that are numeric can be used in math operations in R. While Dottie was in the store she also purchased something for $20.89. We know that the amount each customer spends belongs in the column ‘Spent’ in our data frame, for Dottie this amount will be in the third row.
#This will work out her current amount
Birthdays$Spent[3] + 20.89
Birthdays$Spent[3] <- #this will override the previous save
Birthdays$Spent[3] + 20.89
Birthdays
Exercise
Take the birthdays data frame and try adding a column using the
$ function that adds the following as the customers coffee
orders. Title the column Orders
- Rachel likes to order a Regular Flat White
- Rita’s favourite is a Small Double Shot Long Black
- Dottie usually gets a Half-Strength Cappuccino in a Cup
- Chien-Shiung orders a Iced Latte on Soy Milk
- Gertrude simply like a Strong Black Coffee, Extra Hot
Don’t forget to have a look at your resulting data frame and make sure it worked.
Viewing data frame specifics
Assuming the previous exercise was done correctly, your
Birthdays data set should now look something like this…
Birthdays
Using this data frame, Birthdays, we can explore some
functions that let us interact with the data we have stored. First, here
are some simple ones to keep in mind, especially for large amounts of
data.
These functions only require you to input the name of your stored
data frame, in our example Birthdays
dim() will output two numbers, the first for the number
of rows the second for the number of columns.
dim(Birthdays)
names() will output the names of each of the columns of
your data frame.
names(Birthdays)
Viewing small portions: head() and
tail()
The following two functions will be useful for larger data sets with
greater than 6 observations. head() outputs the first 6
rows of a data frame, while tail() outputs the final 6
columns.
head(Birthdays)
tail(Birthdays)
You could also specify how many entries you would like to include by including this as an extra argument. For example,
head(Birthdays, 3)
tail(Birthdays, 3)
Remember, a dollar sign $ placed between the name of a
data frame and the name of one of a column tells R to reference what is
in that column within that data frame.
unique() displays each unique element of a data frames
column.
unique(Birthdays$Months)
unique(c(1,1,1,2,2,3,3,3))
There are many different data types that exist within R also. We can
assess this using the class() function.
# we can determine the class of complex data objects
class(Birthdays)
# but we can also assess the class of vectors
class(Customers)
# or on individual columns of data
class(Birthdays$Days)
Some functions require the data inputs to be of a particular class, which is why it’s important for us to be able to assess the class of our data objects.
table() displays the number of times each unique element
of a column occurs
table(Birthdays$Months)
R’s Built in dataframes
R also has a selection of built in data frame for use testing code/data analysis methods, and/or (luckily for us) demonstrating these in R.
In this section, we simply show a sample of a number of these build
in data sets which you may see throughout this tutorial series. For
added info on each of these datasets, you can also check out the help
file via help(data-set-name).
Edgar Anderson’s Iris Data (iris)
From the help file on this data: “This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.”
# what columns does it have?
names(iris)
# what are the dimensions?
dim(iris)
# what does the table structure look like?
head(iris)
# the class?
class(iris)
Motor Trend Car Road Tests (mtcars)
From the help file on this data: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”
# what columns does it have?
names(mtcars)
# what are the dimensions?
dim(mtcars)
# what does the table structure look like?
head(mtcars)
# the class?
class(mtcars)
Starwars Characters (starwars)
From the help file on this data: “The original data, from SWAPI, the Star Wars API, https://swapi.dev/, has been revised to reflect additional research into gender and sex determinations of characters.”
# what columns does it have?
names(starwars)
# what are the dimensions?
dim(starwars)
# what does the table structure look like?
head(starwars)
# the class?
class(starwars)
Growth of Loblolly Pine Trees (Loblolly)
From the help file on the data: “The Loblolly data frame has 84 rows and 3 columns of records of the growth of Loblolly pine trees.”
# what columns does it have?
names(Loblolly)
# what are the dimensions?
dim(Loblolly)
# what does the table structure look like?
head(Loblolly)
# the class?
class(Loblolly)
Gapminder Data (gapminder)
Sometimes, we use datasets that exist within a package in R. One such
package is the gapminder package, to use the
gapminder dataset from this package, we must first load the
package, and then the dataset can be used like any other built in
dataset. From the help file: “Excerpt of the Gapminder data on life
expectancy, GDP per capita, and population by country.”
library(gapminder)
# What columns does it have?
names(gapminder)
# What are the dimensions?
dim(gapminder)
# What does the table structure look like?
head(gapminder)
# What is the class of the data object?
class(gapminder)
Importing dataframes with read_csv()
For these tutorials, we will primarily be using built-in data frames,
due to the tutorials being hosted online (this is, not locally on your
machine). However, it is important to know that you can easily import
data in .csv format from your local machine into RStudio
should you need it.
This is done using the readr package and the
read_csv() function like so… Say you have a file titled
ecology.csv in your working directory, then…
library(readr)
ecology <- # need to give it a name so that it stores
read_csv("ecology.csv")
This will be necessary any time you wish to use external data inside your RStudio session, but don’t forget to set your working directory (see tutorial 1).
Manipulating dataframes with dplyr
Introduction to dplyr
dplyr is a package that allows the user to easily and
read-ably manipulate data within R. It does this by having functions
which are verbs representing each of the main actions one can perform on
a data frame.
Pulling out specific columns using pull()
Previously, in this tutorial series, you learned how to access
individual columns of a data frame using the $ notation.
There is a dplyr function for this also that sometimes
gives you extra information, outside of just the values.
# pull('data frame name', 'column name')
pull(mtcars, wt)
# pulling out the species gives some extra info on the levels
# more on this later
pull(iris, Species)
Exercise
Use the mtcars data set and the dplyr
package to pull out each of the wt,
disp and cyl columns individually.
Choosing specific columns: select()
select() outputs one or more specified columns of a data
frame.
select(
# the first input should always be the data frame in question
mtcars,
# and then any columns you wish to output from this data frame
mpg)
# any number of columns can be selected this way
select(mtcars,
mpg, wt, cyl) # Shows both Names and Spent columns of data frame
The output of this function can be saved as a new data frame in the
usual way, using ‘<-’ and a new name, lets use
EfficiencyMeasures.
EfficiencyMeasures <- select(mtcars, wt, cyl)
Exercise
Use the dplyr package and the iris data set
to select all columns of data relating to the
Species and Petal.
Ordering columns: arrange()
arrange() outputs the data frame in either alphabetical
or numeric order based on the columns selected. One or more columns can
be selected the order of them dictates the order they are sorted.
# let's arrange by cylinder value
arrange(mtcars, cyl)
# we can use nested functions also (i.e. functions inside other functions)
# for example, say we want to order by the rownames in mtcars
arrange(mtcars, rownames(mtcars))
# we can even arrange by muliple vairable
# let's see how this looks when applied to the head() of the iris data
arrange(head(iris), Sepal.Length, Sepal.Width)
If the desc() function is used arrange the column name
here, then R will order the data in reverse.
Exercise
Take a look at the starwars data set, and see if you can
arrange the data in reverse alphabetical order by name.
Show data satisfying certain criteria: filter()
filter() allows us to limit our data frame by applying
logical operators given above to specific columns. To filter the data
frame to only show those born in May we can us the ‘==’ operator shown
above.
Time to dive into the starwars data set. Let’s filter
out any star wars characters that don’t have brown hair…
filter(starwars, hair_color == "brown")
You may have noticed in the built-in data sets section, in the
head() of the starwars data that Owen Lars’
hair colour is listed as "brown, grey" but this characters
is not in our filtered list. Here we can get a little more tricky, and
use the stringr package to construct our condition.
We want to keep any data frame entries where the word brown is
detected in the hair_color column.
The function str_detect() returns a TRUE or
FALSE value when a “pattern” string is detected inside
another string, in this case, the strings making up the values inside
the hair_color column. Don’t forget to load the new
package.
filter(starwars, str_detect(hair_color, "brown"))
To limit this data frame in a different way we can use our boolean operations from earlier in the series. For example, if we wanted to only looks at the star wars characters taller than average for the universe.
filter(starwars, height >= mean(height))
# weird, this came out empty
# what if we tell R to remove any NA values from the calculation
filter(starwars, height >= mean(height, na.rm = TRUE))
But if we want to limit it to those who have brown hair AND are above average in height, we would use the ‘&’ operator.
filter(starwars, str_detect(hair_color, "brown") & height >= mean(height, na.rm = TRUE))
Changing or creating a column: mutate()
mutate() allows another column to be added to the data
frame. It is unique in that it allows this to be done using specific
column information.
Let’s add another column to our the airquality data
frame called MonthFact. In the airquality data
set (see the built-in data frames section) we can see that the column
labelled Month gives the month as a numerical value. We
would like to create MonthFact, which will contain the
months as their abbreviated word i.e. 1 = Jan. To do this we need to
borrow a function month() from another package called
lubridate (a package for working with dates).
This data set is quite large, so I’m going to store it as a separate
object and then take a look at the head() to check that is
worked.
airqualityNew <- mutate(
airquality, # the dataframe we wish to modify
# the month() function takes an extra argument `label` which tells it to give us the names
MonthCat = lubridate::month(Month, label = TRUE) # the function month() takes input of the Month column
)
head(airqualityNew, 10)
We can also use mutate to do simple arithmetic manipulations and much more complex things on columns in the data frame.
#mutate('DataFrameName', 'NewColumnName' = 1% of Spent Column plus Days Column)
mutate(starwars, heightm = height/100)
# because of how it prints, we cannot see the values in our resultant dataframe
# let's try it again ans select the columns that are most relevant for our check
mutate(starwars, heightm = height/100) %>%
select(name, height, heightm)
Maybe we don’t need two height columns in our data frame, and instead of creating a new column for our height in metres, we can simply override the original column.
starwarsMetres <- mutate(starwars, height = height/100)
starwarsMetres %>% select(name, height)
# verify that we only have one height column in our dataframe
names(starwarsMetres)
Exercise mutate()
The starwars characters are getting a routine health check up. The doctor knows their height and weight but would like to have their BMI on record also. The formula for this is:
\[ BMI = \frac{"weight \, (kg)"}{("height \, (m)")^2} \]
You task is to add this value into the starwars data frame using the
mutate() function.
# recall what we named the data where height was converted to metres
Telling R to pay attention to certain groupings:
group_by()
group_by() changes the structure of a data frame so that
it groups observations together according to a particular variable
within a column. This can be used in our example data frame
starwars to group the characters by the column
species. Placing the name of the data frame with ‘<-’
before this function overwrites our original data frame. If the original
needs to be retained we can simply create a new name for this data
frame.
starwars.species <- group_by(starwars, species)
starwars.species
The group_by() function does not change how the data
frame appears to us, but it does change the results of
other functions on the data frame to take into account the grouping. One
example is with the summarise() function described
below.
Creating summaries: summarise()
summarise() is a function used most often in tandem with
group_by(), see above. It allows you to summarise data
using functions keeping to whatever grouping structure a data frame has.
Above we grouped our starwars data frame by
species so our summary of data will be for each unique
species.
# summarise('NameDataFrame', 'New Variable Name' = sum of Spent column)
summarise(starwars.species, Av_height = mean(height, na.rm = T))
Compare this to the same command on the original data frame
without the group_by() applied:
# summarise('NameDataFrame', 'New Variable Name' = sum of Spent column)
summarise(starwars, Av_height = mean(height, na.rm = T))
It is important to remember that each of these outputs can be saved in R by simply adding a unique name and <- in front of the function. If you wish to overwrite your original data frame simply place its name before <- and the function code.
Exercise
Consider the gapminder data set from the
gapinder package Use your new found knowledge of the
group_by() and summarise() combination to find
the average life expectancy per continent.
library(gapminder)
Counting entries per group with count()
The count() function does two steps at once, by grouping
a data frame and then tallying up the number of observations that fall
into each of the groups.
# count('NameDataFrame', 'GroupingVariable')
count(starwars, sex)
# your turn! Count how many characters there are with each eye colour
Transforming wide data to long data:
pivot_longer()
Sometimes data comes in a format that is different to how we might like it. It helps to be able to switch between wide and long data frame formats. A wide data frame typically has a column for each new variable, whereas a long dataframe has a column containing all the variable names and a column for each of the values of these.
Let’s take a look at a couple of examples.
First we will create our own wide datafame…
# rnorm() is function that takes random values from a standard normal distribution
wide_df <- data.frame(day = 1:5, X1 = rnorm(5), X2 = rnorm(5))
wide_df
Let’s transform this into a long data frame using
pivot_longer()
# ID becomes the name of the column containing the variable names
# the name of the value column defaults to "value" but this can be set using the values_to= argument
tall_df <- pivot_longer(data = wide_df, c("X1", "X2"), names_to = "ID")
tall_df
Wrangling categorical variables using the forcats
package
If we wish to treat a categorical variable in a data frame as an
ordinal variable, it is best to convert it to a factor
class variable. Factors are variables which have certain
levels which then exist in some kind of order. The
forcats package exists to help us wrangle variables of this
type.
Changing the baseline / order of categorical variables:
fct_relevel()
The fct_relevel() function can be used to re-specify the
order of factors (Note that the first level is usually considered the
baseline, and this is often that which is alphabetically
first). In the mtcars dataset, the default baseline for the
am variable is the “0” category:
# convert the am variable to a factor using the factor() function
mtcarsFct <- mutate(mtcars, am = factor(am))
# take a look at how it now outputs that column
pull(mtcarsFct, am)
The code below replaces the am column with a re-levelled
version of itself (none of the data is changed, the only change is what
R consider the baseline category). Note that the baseline is now the
category “1”.
mtcarsFct <- mutate(mtcarsFct, am = fct_relevel(am, "1", "0"))
# compare this to the previous output
pull(mtcarsFct, am)
Changing the names of categorical variables:
fct_recode()
The fct_recode() function can be used to rename some or
all categories. The first argument is a data frame column that is a
factor, and then each change follows with a comma in between.
mtcarsFct <- mutate(mtcarsFct,
am = fct_recode(am, "auto" = "0",
"manual" = "1"))
# check how it looks now
pull(mtcarsFct, am)
Recategorizing certain factors as “other”:
fct_other()
The fct_other function can be used to put data with
certain category values into a single category. It is useful when one
has few observations for a large number of categories.
The following code re-categorizes any data that doesn’t have the species “setosa” as “other”:
Alternatively, you can specify specifically which categories you wish to amalgamate into the “other” category
fct_other(iris$Species, drop = c("virginica", "versicolor"), other_level = "other")
Other Useful Things
The pipe operator: %>%
The pipe operator %>% takes what is on its left hand
side, and uses it as the first argument for what is on its right hand
side. It is an invaluable tool to make one’s R code more readable (both
for the person writing it and others). The following example
demonstrates three ways to summarise the iris data set via
species group mean
# two lines, easy to follow, but it a bit verbose
# (also requires creating the intermediate data frame `grouped_iris`)
grouped_iris <- group_by(iris, Species)
summarise(grouped_iris, meanWidth = mean(Petal.Width))
# all on one line, hard to follow
summarise(group_by(iris, Species), meanWidth = mean(Petal.Width))
# using pipes, much more elegant
iris %>%
group_by(Species) %>%
summarise(meanWidth = mean(Petal.Width))
Multiple functions at once using across() within other
functions
An advanced feature is to across inside of functions
such as summarise to avoid having to run multiple lines. The part
.names = "{.fn}_{.col} tells R to name the columns starting
with the function, then an underscore, and then the name of the original
column.
grouped_iris <- group_by(iris, Species)
# gives grouped mean and standard deviation
summarise(grouped_iris, mean.Petal.Width = mean(Petal.Width),
sd.Petal.Width = sd(Petal.Width))
summarise(grouped_iris, mean.Petal.Length = mean(Petal.Length),
sd.Petal.Length = sd(Petal.Length))
# does the same as above, but puts them into a single table
summarise(grouped_iris, across(c(Petal.Width, Petal.Length),
list(mean = mean, sd = sd),
.names = "{.fn}_{.col}"))
Test Your Knowledge
You can use the R window below to help find the answers to some quiz questions.