Skip to Tutorial Content

QUT School of Mathematical Sciences   Tutorial 3 - Data Manipulation in R

Introduction

Make sure you have access to or have installed R and RStudio prior to beginning this tutorial. There are details on how to do this correctly in tutorial 1. If following along on your own version of RStudio, this would be the time to open up a new script (and/or project) file, and save it in a logical place.

Data in R

A data frame is a way of storing a table of information, and much of what we do in R will involve data frames. In this tutorial you will learn how to extract information out of data frames, as well as import data frame from elsewhere on your computer, build new data frames, use built-in data frames, and manipulate data frames in a number of different ways using the dplyr package. This might include calculating new variables, filtering by some condition, and other general data tidying. We will also talk a little on some common errors people make when learning how to deal with data frames. Finally, we will introduce a few functions from the forcats package that may help you to wrangle any categorical variable you may come across.

Setting Up

We should start our session by loading any packages we plan on using. Today, we are learning about the dplyr and forcats packages, and also looking at a few functions from the stringr package, so make sure that these are installed and then load them using the following code.

#install.packages("dplyr")
library(dplyr)
#install.packages("forcats")
library(forcats)
#install.packages("stringr")
library(stringr)

Creating data frames using data.frame()

To create one we use the function data.frame(). For example, I would like to store my 5 favorite customers birthdays so that I can be sure to remember them.

# Let'd first store my customers names in a vector to make things easier for me 
Customers <- c('Rachel', 'Rita', 'Dorothy', 'Chien-Shiung', 'Gertrude')

# The I want to create my data frame 
Birthdays <- # This line will save in R whatever follows as Birthdays
data.frame( #this opens the function
  Names = Customers, # Creates a column (Names) from vector called Customers
  Months = c('May', 'April', 'May', 'May', 'January'), # Creates a column (Months) with new data 
  Days = c(27, 22, 12, 31, 23) # Creates a column Days with new data
)

Birthdays  #This will make R show the dataframe we have just saved as Birthdays 

Individual columns of a data frame can then be referenced by using the name of the data frame and the column name and separating them with a ‘$’.

Birthdays$Names #This will display the Names column

Birthdays$Days #This will display the Days column

# Try and access the Months column in the next line... 

This referencing can also be used to add a column to a saved data frame. For example if we wished to add the amount our customers have spent, which we previously saved as a vector called Spendings to our data frame we can do this by declaring that a new column of Birthdays called Spent is equal to our saved vector.

Birthdays$Spent <- #Everything after this will fill the column Spending
   c(473, 826.92, 932, 1040, 273.36)

Birthdays

Specific elements of a data frame can be referenced using this method and adding the row number of the element inside of the square brackets [ ].

Birthdays$Names[3] #Third row of Names column 

Birthdays$Spent[5]  #Fifth row of Spent column 

This referencing can also be used to alter our data frame. For example, if Dorothy wanted to be called by her nickname Dottie we can change this in our data frame as we know from above where her name is stored, in the data frame ‘Birthdays’, in the column Names in the Third row.

Birthdays$Names[3] <- "Dottie"

Birthdays$Names[3] #Third row of Names column 

Specific elements of a data frame -those referenced by their data frame name, column name and row placement- that are numeric can be used in math operations in R. While Dottie was in the store she also purchased something for $20.89. We know that the amount each customer spends belongs in the column ‘Spent’ in our data frame, for Dottie this amount will be in the third row.

#This will work out her current amount 
Birthdays$Spent[3] + 20.89

Birthdays$Spent[3] <- #this will override the previous save
  Birthdays$Spent[3] + 20.89

Birthdays

Exercise

Take the birthdays data frame and try adding a column using the $ function that adds the following as the customers coffee orders. Title the column Orders

  • Rachel likes to order a Regular Flat White
  • Rita’s favourite is a Small Double Shot Long Black
  • Dottie usually gets a Half-Strength Cappuccino in a Cup
  • Chien-Shiung orders a Iced Latte on Soy Milk
  • Gertrude simply like a Strong Black Coffee, Extra Hot

Don’t forget to have a look at your resulting data frame and make sure it worked.

Viewing data frame specifics

Assuming the previous exercise was done correctly, your Birthdays data set should now look something like this…

Birthdays

Using this data frame, Birthdays, we can explore some functions that let us interact with the data we have stored. First, here are some simple ones to keep in mind, especially for large amounts of data.

These functions only require you to input the name of your stored data frame, in our example Birthdays

dim() will output two numbers, the first for the number of rows the second for the number of columns.

dim(Birthdays)

names() will output the names of each of the columns of your data frame.

names(Birthdays)

Viewing small portions: head() and tail()

The following two functions will be useful for larger data sets with greater than 6 observations. head() outputs the first 6 rows of a data frame, while tail() outputs the final 6 columns.

head(Birthdays)

tail(Birthdays)

You could also specify how many entries you would like to include by including this as an extra argument. For example,

head(Birthdays, 3)

tail(Birthdays, 3)

Remember, a dollar sign $ placed between the name of a data frame and the name of one of a column tells R to reference what is in that column within that data frame.

unique() displays each unique element of a data frames column.

unique(Birthdays$Months)
It also works on vectors:
unique(c(1,1,1,2,2,3,3,3))

There are many different data types that exist within R also. We can assess this using the class() function.

# we can determine the class of complex data objects 
class(Birthdays)

# but we can also assess the class of vectors 
class(Customers)

# or on individual columns of data 
class(Birthdays$Days)

Some functions require the data inputs to be of a particular class, which is why it’s important for us to be able to assess the class of our data objects.

table() displays the number of times each unique element of a column occurs

table(Birthdays$Months)

R’s Built in dataframes

R also has a selection of built in data frame for use testing code/data analysis methods, and/or (luckily for us) demonstrating these in R.

In this section, we simply show a sample of a number of these build in data sets which you may see throughout this tutorial series. For added info on each of these datasets, you can also check out the help file via help(data-set-name).

Edgar Anderson’s Iris Data (iris)

From the help file on this data: “This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.”

# what columns does it have?
names(iris)
# what are the dimensions? 
dim(iris)
# what does the table structure look like?
head(iris)
# the class?
class(iris)

Motor Trend Car Road Tests (mtcars)

From the help file on this data: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).”

# what columns does it have?
names(mtcars)
# what are the dimensions? 
dim(mtcars)
# what does the table structure look like?
head(mtcars)
# the class?
class(mtcars)

Starwars Characters (starwars)

From the help file on this data: “The original data, from SWAPI, the Star Wars API, https://swapi.dev/, has been revised to reflect additional research into gender and sex determinations of characters.”

# what columns does it have?
names(starwars)
# what are the dimensions? 
dim(starwars)
# what does the table structure look like?
head(starwars)
# the class?
class(starwars)

Growth of Loblolly Pine Trees (Loblolly)

From the help file on the data: “The Loblolly data frame has 84 rows and 3 columns of records of the growth of Loblolly pine trees.”

# what columns does it have?
names(Loblolly)
# what are the dimensions? 
dim(Loblolly)
# what does the table structure look like?
head(Loblolly)
# the class?
class(Loblolly)

Gapminder Data (gapminder)

Sometimes, we use datasets that exist within a package in R. One such package is the gapminder package, to use the gapminder dataset from this package, we must first load the package, and then the dataset can be used like any other built in dataset. From the help file: “Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.”

library(gapminder)
# What columns does it have? 
names(gapminder)
# What are the dimensions?
dim(gapminder)
# What does the table structure look like? 
head(gapminder)
# What is the class of the data object?
class(gapminder)

Importing dataframes with read_csv()

For these tutorials, we will primarily be using built-in data frames, due to the tutorials being hosted online (this is, not locally on your machine). However, it is important to know that you can easily import data in .csv format from your local machine into RStudio should you need it.

This is done using the readr package and the read_csv() function like so… Say you have a file titled ecology.csv in your working directory, then…

library(readr)
ecology <- # need to give it a name so that it stores 
  read_csv("ecology.csv")

This will be necessary any time you wish to use external data inside your RStudio session, but don’t forget to set your working directory (see tutorial 1).

Manipulating dataframes with dplyr

Introduction to dplyr

dplyr is a package that allows the user to easily and read-ably manipulate data within R. It does this by having functions which are verbs representing each of the main actions one can perform on a data frame.

Pulling out specific columns using pull()

Previously, in this tutorial series, you learned how to access individual columns of a data frame using the $ notation. There is a dplyr function for this also that sometimes gives you extra information, outside of just the values.

# pull('data frame name', 'column name') 
pull(mtcars, wt)

# pulling out the species gives some extra info on the levels
# more on this later 
pull(iris, Species)

Exercise

Use the mtcars data set and the dplyr package to pull out each of the wt, disp and cyl columns individually.

Choosing specific columns: select()

select() outputs one or more specified columns of a data frame.

select(
  # the first input should always be the data frame in question 
  mtcars,
  # and then any columns you wish to output from this data frame 
  mpg) 

# any number of columns can be selected this way 
select(mtcars, 
       mpg, wt, cyl) # Shows both Names and Spent columns of data frame

The output of this function can be saved as a new data frame in the usual way, using ‘<-’ and a new name, lets use EfficiencyMeasures.

EfficiencyMeasures <- select(mtcars, wt, cyl)

Exercise

Use the dplyr package and the iris data set to select all columns of data relating to the Species and Petal.

Ordering columns: arrange()

arrange() outputs the data frame in either alphabetical or numeric order based on the columns selected. One or more columns can be selected the order of them dictates the order they are sorted.

# let's arrange by cylinder value
arrange(mtcars, cyl)

# we can use nested functions also (i.e. functions inside other functions)
# for example, say we want to order by the rownames in mtcars 
arrange(mtcars, rownames(mtcars))

# we can even arrange by muliple vairable 
# let's see how this looks when applied to the head() of the iris data 
arrange(head(iris), Sepal.Length, Sepal.Width)

If the desc() function is used arrange the column name here, then R will order the data in reverse.

Exercise

Take a look at the starwars data set, and see if you can arrange the data in reverse alphabetical order by name.

Show data satisfying certain criteria: filter()

filter() allows us to limit our data frame by applying logical operators given above to specific columns. To filter the data frame to only show those born in May we can us the ‘==’ operator shown above.

Time to dive into the starwars data set. Let’s filter out any star wars characters that don’t have brown hair…

filter(starwars, hair_color == "brown")

You may have noticed in the built-in data sets section, in the head() of the starwars data that Owen Lars’ hair colour is listed as "brown, grey" but this characters is not in our filtered list. Here we can get a little more tricky, and use the stringr package to construct our condition.

We want to keep any data frame entries where the word brown is detected in the hair_color column.

The function str_detect() returns a TRUE or FALSE value when a “pattern” string is detected inside another string, in this case, the strings making up the values inside the hair_color column. Don’t forget to load the new package.

filter(starwars, str_detect(hair_color, "brown"))

To limit this data frame in a different way we can use our boolean operations from earlier in the series. For example, if we wanted to only looks at the star wars characters taller than average for the universe.

filter(starwars, height >= mean(height))

# weird, this came out empty
# what if we tell R to remove any NA values from the calculation 
filter(starwars, height >= mean(height, na.rm = TRUE))

But if we want to limit it to those who have brown hair AND are above average in height, we would use the ‘&’ operator.

filter(starwars, str_detect(hair_color, "brown") & height >= mean(height, na.rm = TRUE))

Changing or creating a column: mutate()

mutate() allows another column to be added to the data frame. It is unique in that it allows this to be done using specific column information.

Let’s add another column to our the airquality data frame called MonthFact. In the airquality data set (see the built-in data frames section) we can see that the column labelled Month gives the month as a numerical value. We would like to create MonthFact, which will contain the months as their abbreviated word i.e. 1 = Jan. To do this we need to borrow a function month() from another package called lubridate (a package for working with dates).

This data set is quite large, so I’m going to store it as a separate object and then take a look at the head() to check that is worked.

airqualityNew <- mutate(
                      airquality, # the dataframe we wish to modify
                      # the month() function takes an extra argument `label` which tells it to give us the names
                      MonthCat = lubridate::month(Month, label = TRUE) # the function month() takes input of the Month column
                      )

head(airqualityNew, 10)

We can also use mutate to do simple arithmetic manipulations and much more complex things on columns in the data frame.

#mutate('DataFrameName', 'NewColumnName' = 1% of Spent Column plus Days Column)
mutate(starwars, heightm = height/100)

# because of how it prints, we cannot see the values in our resultant dataframe
# let's try it again ans select the columns that are most relevant for our check 
mutate(starwars, heightm = height/100) %>% 
  select(name, height, heightm)

Maybe we don’t need two height columns in our data frame, and instead of creating a new column for our height in metres, we can simply override the original column.

starwarsMetres <- mutate(starwars, height = height/100) 

starwarsMetres %>% select(name, height)

# verify that we only have one height column in our dataframe 
names(starwarsMetres)

Exercise mutate()

The starwars characters are getting a routine health check up. The doctor knows their height and weight but would like to have their BMI on record also. The formula for this is:

\[ BMI = \frac{"weight \, (kg)"}{("height \, (m)")^2} \]

You task is to add this value into the starwars data frame using the mutate() function.

# recall what we named the data where height was converted to metres

Telling R to pay attention to certain groupings: group_by()

group_by() changes the structure of a data frame so that it groups observations together according to a particular variable within a column. This can be used in our example data frame starwars to group the characters by the column species. Placing the name of the data frame with ‘<-’ before this function overwrites our original data frame. If the original needs to be retained we can simply create a new name for this data frame.

starwars.species <- group_by(starwars, species)
starwars.species

The group_by() function does not change how the data frame appears to us, but it does change the results of other functions on the data frame to take into account the grouping. One example is with the summarise() function described below.

Creating summaries: summarise()

summarise() is a function used most often in tandem with group_by(), see above. It allows you to summarise data using functions keeping to whatever grouping structure a data frame has. Above we grouped our starwars data frame by species so our summary of data will be for each unique species.

# summarise('NameDataFrame', 'New Variable Name' = sum of Spent column)
summarise(starwars.species, Av_height = mean(height, na.rm = T))

Compare this to the same command on the original data frame without the group_by() applied:

# summarise('NameDataFrame', 'New Variable Name' = sum of Spent column)
summarise(starwars, Av_height = mean(height, na.rm = T))

It is important to remember that each of these outputs can be saved in R by simply adding a unique name and <- in front of the function. If you wish to overwrite your original data frame simply place its name before <- and the function code.

Exercise

Consider the gapminder data set from the gapinder package Use your new found knowledge of the group_by() and summarise() combination to find the average life expectancy per continent.

library(gapminder)

Counting entries per group with count()

The count() function does two steps at once, by grouping a data frame and then tallying up the number of observations that fall into each of the groups.

# count('NameDataFrame', 'GroupingVariable')
count(starwars, sex)

# your turn! Count how many characters there are with each eye colour 

Transforming wide data to long data: pivot_longer()

Sometimes data comes in a format that is different to how we might like it. It helps to be able to switch between wide and long data frame formats. A wide data frame typically has a column for each new variable, whereas a long dataframe has a column containing all the variable names and a column for each of the values of these.

Let’s take a look at a couple of examples.

First we will create our own wide datafame…

# rnorm() is function that takes random values from a standard normal distribution 
wide_df <- data.frame(day = 1:5, X1 = rnorm(5), X2 = rnorm(5))
wide_df

Let’s transform this into a long data frame using pivot_longer()

# ID becomes the name of the column containing the variable names 
# the name of the value column defaults to "value" but this can be set using the values_to= argument 
tall_df <- pivot_longer(data = wide_df, c("X1", "X2"), names_to = "ID")
tall_df

Wrangling categorical variables using the forcats package

If we wish to treat a categorical variable in a data frame as an ordinal variable, it is best to convert it to a factor class variable. Factors are variables which have certain levels which then exist in some kind of order. The forcats package exists to help us wrangle variables of this type.

Changing the baseline / order of categorical variables: fct_relevel()

The fct_relevel() function can be used to re-specify the order of factors (Note that the first level is usually considered the baseline, and this is often that which is alphabetically first). In the mtcars dataset, the default baseline for the am variable is the “0” category:

# convert the am variable to a factor using the factor() function 
mtcarsFct <- mutate(mtcars, am = factor(am))
# take a look at how it now outputs that column 
pull(mtcarsFct, am)

The code below replaces the am column with a re-levelled version of itself (none of the data is changed, the only change is what R consider the baseline category). Note that the baseline is now the category “1”.

mtcarsFct <- mutate(mtcarsFct, am = fct_relevel(am, "1", "0"))
# compare this to the previous output 
pull(mtcarsFct, am)

Changing the names of categorical variables: fct_recode()

The fct_recode() function can be used to rename some or all categories. The first argument is a data frame column that is a factor, and then each change follows with a comma in between.

mtcarsFct <- mutate(mtcarsFct,
                 am = fct_recode(am, "auto" = "0",
                                    "manual" = "1"))
# check how it looks now 
pull(mtcarsFct, am)

Recategorizing certain factors as “other”: fct_other()

The fct_other function can be used to put data with certain category values into a single category. It is useful when one has few observations for a large number of categories.

The following code re-categorizes any data that doesn’t have the species “setosa” as “other”:

Alternatively, you can specify specifically which categories you wish to amalgamate into the “other” category

fct_other(iris$Species, drop = c("virginica", "versicolor"), other_level = "other")

Other Useful Things

The pipe operator: %>%

The pipe operator %>% takes what is on its left hand side, and uses it as the first argument for what is on its right hand side. It is an invaluable tool to make one’s R code more readable (both for the person writing it and others). The following example demonstrates three ways to summarise the iris data set via species group mean

# two lines, easy to follow, but it a bit verbose
# (also requires creating the intermediate data frame `grouped_iris`)
grouped_iris <- group_by(iris, Species)
summarise(grouped_iris, meanWidth = mean(Petal.Width))

# all on one line, hard to follow
summarise(group_by(iris, Species), meanWidth = mean(Petal.Width))

# using pipes, much more elegant
iris %>% 
  group_by(Species) %>% 
  summarise(meanWidth = mean(Petal.Width))

Multiple functions at once using across() within other functions

An advanced feature is to across inside of functions such as summarise to avoid having to run multiple lines. The part .names = "{.fn}_{.col} tells R to name the columns starting with the function, then an underscore, and then the name of the original column.

grouped_iris <- group_by(iris, Species)

# gives grouped mean and standard deviation
summarise(grouped_iris, mean.Petal.Width = mean(Petal.Width),
          sd.Petal.Width = sd(Petal.Width))
summarise(grouped_iris, mean.Petal.Length = mean(Petal.Length), 
          sd.Petal.Length = sd(Petal.Length))

# does the same as above, but puts them into a single table
summarise(grouped_iris, across(c(Petal.Width, Petal.Length), 
                               list(mean = mean, sd = sd), 
                               .names = "{.fn}_{.col}"))

Test Your Knowledge

You can use the R window below to help find the answers to some quiz questions.

Quiz

Tutorial 3 - Data Manipulation in R