Chapter 2 Manipulating data
The dplyr
package is the main component of the tidyverse for data maniuplation. You can load it either by loading the tidyverse package or by loading dplyr directly.
2.1 Example: manipulating flights data
We’ve already seen some of the basic functions in Chapter 1. Let’s look at a more elaborate example using a dataset from the nycflights13
package. This package contains information about all flights that departed from NYC (e.g., EWR, JFK and LGA) in 2013. The dataset is quite large and includes information on delays, airlines, airports, weather, and planes. It contains 336,776 rows and 19 columns.
We first load the nycflights13
package to get the flights
dataset.
## [1] 336776 19
For our analysis, we want to focus on flights that were less than 1000 miles (distance) and restrict the dataset to keep only the columns: dep_delay
, arr_delay
, origin
, dest
, air_time
, and distance
.
In addition, we convert the departure and arrival delays into hours, and calculate the average flight speed (in mph).
We also want to add a new column with the Z-score for departure delays.
Finally, we want to order by average flight speed (fastest to slowest) and return the first four rows.
This is the full data processing pipeline:
Code
df <- flights %>%
# step 1
filter(distance < 1000) %>%
# step 2
select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
# step 3
mutate(
Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
sd(dep_delay, na.rm=TRUE),
dep_delay = dep_delay / 60,
arr_delay = arr_delay / 60,
speed = distance / (air_time / 60)
) %>%
# step 4
arrange(-speed) %>%
# step 5
print(n=4)
## # A tibble: 189,671 × 8
## dep_delay arr_delay origin dest air_time distance Z_dep_delay speed
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0.15 -0.233 LGA ATL 65 762 -0.108 703.
## 2 0.25 -0.0167 EWR GSP 55 594 0.0355 648
## 3 0.0667 0.0333 EWR BNA 70 748 -0.228 641.
## 4 0.267 -0.367 EWR CVG 62 569 0.0595 551.
## # ℹ 189,667 more rows
Step 1: The filter
function affects the rows of the dataframe. Here, it keeps only flights with a distance
less than 1000 miles. The dplyr
package allows us to use the column names without quotes and without the table name.
Step 2: select
affects the columns of the dataframe. Here, it keeps the listed six columns. The select statement accepts a number of helper functions to select columns based on their names. For example, starts_with("delay")
selects all columns that start with the word “delay”. The :
operator is used to select a range of columns. Using this functionality, we could have written the select
statement also as:
select(ends_with("delay"), origin:distance) %>%
Step 3: This steps modifies columns and adds new ones. The first assignment, creates the new column Z_dep_delay
which is the Z-score of the departure delay. The second and third assignments convert the departure and arrival delays from minutes to hours. The fourth assignment calculates the average flight speed in miles per hour.
Step 4: The arrange
function is used to order the rows by speed. The -
sign is used to indicate descending order.
Step 5: Finally, we print the first 4 rows. This serves only for information and doesn’t change the tibble
which is assigned to the variable df
.
2.2 Overview of dplyr
functionality
We’ve seen some of the basic functions in the example above. In most cases, these will be sufficient for your data manipulation needs. It is however useful to have an understanding of the whole package.
Todo:
Visit the dplyr
website (https://dplyr.tidyverse.org/) and read the Get started section. This will give you a good overview of the package.
The functionality can be organized into several categories.
2.2.1 Sorting data
The function arrange()
is used to sort the data. Here are a few examples:
## # A tibble: 6 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 1 5
## 3 2 5
## 4 2 6
## 5 3 4
## 6 3 6
## # A tibble: 6 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 3 4
## 3 1 5
## 4 2 5
## 5 2 6
## 6 3 6
## # A tibble: 6 × 2
## a b
## <dbl> <dbl>
## 1 2 6
## 2 3 6
## 3 1 5
## 4 2 5
## 5 1 4
## 6 3 4
## # A tibble: 6 × 2
## a b
## <dbl> <dbl>
## 1 3 4
## 2 3 6
## 3 2 5
## 4 2 6
## 5 1 4
## 6 1 5
The examples show that we can sort by one or more columns. The default is ascending order. To sort in descending order, we can use the desc()
function or the -
sign. Multiple columns are sorted in the order they are listed. For example, -a, b
first sorts by column a
in descending order, then for rows with the same value in column a
, it sorts by column b
in ascending order.
2.2.2 Selecting columns
The function select()
is used to select columns. It can be used to keep only certain columns or to drop columns. The select()
function accepts a number of helper functions to select columns based on their names. Here are a few examples:
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
## [1] "arr_time" "arr_delay"
## [1] "dep_time" "sched_dep_time" "arr_time" "sched_arr_time"
## [5] "air_time" "time_hour"
## [1] "dep_delay" "arr_delay"
## [1] "dep_time" "sched_dep_time" "arr_time" "sched_arr_time"
## [5] "air_time" "hour" "minute" "time_hour"
Code
## [1] "dep_delay" "arr_delay" "origin" "dest" "air_time" "distance"
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "origin" "dest" "air_time"
## [13] "distance" "hour" "minute" "time_hour"
## [1] "year" "month" "day" "dep_delay" "arr_delay" "carrier"
## [7] "flight" "tailnum" "origin" "dest" "distance"
2.2.3 Filtering rows
The function filter()
is used to filter rows. It keeps only rows that satisfy the condition. Here are a few examples:
Code
## # A tibble: 3 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 4 1 4
## 3 6 1 6
## # A tibble: 2 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 4 1 4
## # A tibble: 4 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 3 3 6
## 3 4 1 4
## 4 5 3 5
## # A tibble: 5 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 3 3 6
## 3 4 1 4
## 4 5 3 5
## 5 6 1 6
## # A tibble: 2 × 3
## row a b
## <int> <dbl> <dbl>
## 1 2 2 5
## 2 7 NA 8
## # A tibble: 1 × 3
## row a b
## <int> <dbl> <dbl>
## 1 7 NA 8
distinct
removes duplicate rows:
## # A tibble: 6 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 2 2 5
## 3 3 3 6
## 4 5 3 5
## 5 6 1 6
## 6 7 NA 8
It’s also possible to select rows using the row number. The function slice()
is used for this purpose.
## # A tibble: 3 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 2 2 5
## 3 3 3 6
## # A tibble: 3 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 3 3 6
## 3 5 3 5
## # A tibble: 3 × 3
## row a b
## <int> <dbl> <dbl>
## 1 2 2 5
## 2 4 1 4
## 3 6 1 6
slice_min
and slice_max
are useful if you want to identify the rows with the smallest or largest values in a column.
## # A tibble: 3 × 3
## row a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 4 1 4
## 3 6 1 6
## # A tibble: 2 × 3
## row a b
## <int> <dbl> <dbl>
## 1 3 3 6
## 2 5 3 5
slice_sample
selects a random sample of rows.
## # A tibble: 5 × 3
## row a b
## <int> <dbl> <dbl>
## 1 4 1 4
## 2 6 1 6
## 3 1 1 4
## 4 7 NA 8
## 5 5 3 5
## # A tibble: 5 × 3
## row a b
## <int> <dbl> <dbl>
## 1 4 1 4
## 2 6 1 6
## 3 4 1 4
## 4 7 NA 8
## 5 2 2 5
2.2.4 Modifying tables
We’ve already seen the mutate
function that is used to add new columns or modify existing ones.
Code
## # A tibble: 3 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 4 5
## 2 2 5 7
## 3 3 6 9
## # A tibble: 3 × 5
## a b c d e
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 5 -3 -15
## 2 2 5 7 -3 -21
## 3 3 6 9 -3 -27
The last example shows that the new columns (e
) can be based on previously created columns (c
and d
).
## # A tibble: 3 × 2
## a b
## <dbl> <dbl>
## 1 2 4
## 2 3 5
## 3 4 6
If a column name already exists, the original column will be replaced. To avoid this, use the add_column
function. It will throw an error if the column name exists. The add_column
function can also be used to add a column in a specific position.
## # A tibble: 3 × 3
## a b idx
## <dbl> <dbl> <int>
## 1 1 4 1
## 2 2 5 2
## 3 3 6 3
## # A tibble: 3 × 3
## idx a b
## <int> <dbl> <dbl>
## 1 1 1 4
## 2 2 2 5
## 3 3 3 6
add_row
adds new rows to the table. We can again use .before
and .after
to specify the position.
## # A tibble: 4 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 5
## 3 3 6
## 4 4 7
## # A tibble: 4 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 4 7
## 3 2 5
## 4 3 6
2.3 Handling missing values
Real life data are often messy and contain missing values. The tidyverse packages dplyr
and tidyr
have functions that help dealing with this. In datasets, missing values are not always identified using NA. Sometimes, they are identified using a placeholder value, e.g. -9999.
2.3.1 Convert placeholders to missing values
Here is an example of a dataset with placeholder values that get converted to missing values.
Code
## # A tibble: 4 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 NA
## 3 3 6
## 4 NA 8
2.3.2 Remove or replace missing values
The tidyr
package has quite a few functions for handling missing values.
We already encountered the drop_na
function to remove rows with missing values.
Code
## # A tibble: 2 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 3 6
## # A tibble: 3 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 NA
## 3 3 6
The replace_na
function replaces missing values with a specified value.
## # A tibble: 4 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 99
## 3 3 6
## 4 0 8
To replace the missing values with the mean of each column, we can use:
## # A tibble: 4 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 6
## 3 3 6
## 4 2 8
2.4 Split - Apply - Combine
The dplyr
operations are even more powerful when they can be used with grouping variables. This approach to data processing is also known as Split - Apply - Combine (Wickham 2011).
Split
: The data is split into groups based on one or more grouping variables.Apply
: A function is applied to each group independently.Combine
: The results are combined into a new data structure.
2.4.1 Two examples
Let’s see this approach in action using the flights
dataset. We want to calculate the average departure delay for each airline and sort the resulting table by the average delay.
Code
## # A tibble: 16 × 2
## carrier avg_dep_delay
## <chr> <dbl>
## 1 US 3.78
## 2 HA 4.90
## 3 AS 5.80
## 4 AA 8.59
## 5 DL 9.26
## 6 MQ 10.6
## 7 UA 12.1
## 8 OO 12.6
## 9 VX 12.9
## 10 B6 13.0
## 11 9E 16.7
## 12 WN 17.7
## 13 FL 18.7
## 14 YV 19.0
## 15 EV 20.0
## 16 F9 20.2
Step 1: The group_by
function is used to specify the grouping variable(s). In this case, we group by the carrier
variable. Note: grouping should to be applied on discrete variables (categorical, factor, or maybe integer valued columns).
Step 2: This step is applied to the individual groups. The summarize
function is used to calculate the average departure delay for each group. The na.rm=TRUE
argument is used to ignore missing values. Following the summarization, the data is combined into a single tibble. It contains only the grouping variable(s) and the new summary variable(s).
Step 3: We can now continue with normal processing. The arrange
function is used to sort the resulting table by the average delay.
We can group by multiple variables. For example,
Code
## # A tibble: 224 × 6
## # Groups: origin, dest [224]
## origin dest max.delay avg.delay min.delay count
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 EWR ALB 328 14.4 -34 439
## 2 EWR ANC 39 -2.5 -47 8
## 3 EWR ATL 796 13.2 -39 5022
## 4 EWR AUS 349 -0.474 -59 968
## 5 EWR AVL 228 8.80 -26 265
## 6 EWR BDL 266 7.05 -43 443
## 7 EWR BNA 364 12.7 -41 2336
## 8 EWR BOS 422 4.78 -47 5327
## 9 EWR BQN 208 10.9 -43 297
## 10 EWR BTV 306 12.2 -41 931
## # ℹ 214 more rows
Useful to know:
Instead of first grouping the data and then summarizing, we can use the .by
argument int the summarize function. This is more efficient and can be used for simple summaries.
Code
## # A tibble: 224 × 6
## origin dest max.delay avg.delay min.delay count
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 EWR IAH 374 5.41 -63 3973
## 2 LGA IAH 435 1.45 -59 2951
## 3 JFK MIA 614 -1.99 -64 3314
## 4 JFK BQN 183 6.94 -32 599
## 5 LGA ATL 895 11.3 -49 10263
## 6 EWR ORD 1109 9.00 -59 6100
## 7 EWR FLL 375 7.76 -56 3793
## 8 LGA IAD 398 13.7 -47 1803
## 9 JFK MCO 329 5.30 -63 5464
## 10 LGA ORD 448 1.83 -62 8857
## # ℹ 214 more rows
2.4.2 Applying functions to groups adding to the original table
The mutate
function can be used to add new columns to the original table but calculated for each group individually. For example,
## # A tibble: 4 × 3
## a b mean_a
## <dbl> <dbl> <dbl>
## 1 1 4 1
## 2 2 NA 2
## 3 3 6 3
## 4 NA 8 NA
Step 1: Group the data using the column b
Step 2: Calculate the mean of column a
for each group. The mutate
function applies in this case only to the group and not the full table.
Step 3: To restore the full table, use ungroup
. This removes the grouping.
2.4.3 Shortcuts
Some Split - Apply - Combine steps occur so frequently that dplyr
provides shortcuts for them.
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
is equivalent to
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
2.5 Concatenate and join data
2.5.1 Concatenating data
Concatenating data is the process of combining two or more datasets into a single one. The dplyr
package provides the bind_rows
and bind_cols
functions for this purpose. The bind_rows
function is used to combine rows from two datasets. The bind_cols
function is used to combine columns from two datasets.
The dplyr::bind_rows
function combines rows from two or more datasets. This is useful if your dataset is split into parts and you need to combine them.
Code
## # A tibble: 6 × 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 4 NA
## 2 2 5 NA
## 3 3 6 NA
## 4 4 7 10
## 5 5 8 11
## 6 6 9 12
Columns are matched by name. If a data frame has missing columns, the missing columns are filled with NA
.
The dplyr::bind_cols
function combines columns from two or more datasets. This is useful if you want to add columns to an existing dataset.
## # A tibble: 3 × 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 4 7
## 2 2 5 5 8
## 3 3 6 6 9
The rows are combined in the order they appear, so be careful that the rows match. If you have a common identifier in the data frames, using and of the join functions is a safer option.
If columns occur in both dataframes, the column names will be adjusted to be unique. In the following example, the a
columns are renamed to a...1
and a...3
.
## New names:
## • `a` -> `a...1`
## • `a` -> `a...3`
## # A tibble: 3 × 4
## a...1 b a...3 c
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 4 7
## 2 2 5 5 8
## 3 3 6 6 9
2.5.2 Joins
Joins are used to combine or merge two datasets, based on common identfiiers. The dplyr
package provides a number of functions for this purpose. The main functions are inner_join
, left_join
, right_join
, and full_join
. The inner_join
function is the most commonly used one. It keeps only rows that have matching values in both datasets. The left_join
function keeps all rows from the first dataset and adds columns from the second dataset if there is a match. The right_join
function is the reverse of left_join
. The full_join
function keeps all rows from both datasets. We will not cover this here but you can read more about joins in the R4DS book.
2.6 Additional tidyr functions
The tidyr
package has a few more functions that are useful for data manipulation and cleanup. We list these here for completeness.
pivot_wider()/spread()
: Spreads a pair of key:value columns into a set of tidy columnspivot_longer()/gather()
: Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You usepivot_longer()/gather()
when you notice that you have columns that are not variablesseparate()
turns a single character column into multiple columnsunite()
paste together multiple columns into one (reverse ofseparate()
)
Further information:
The dplyr cheatsheet is a two-page summary of all the main features of dplyr. For more details about dplyr, see the main website at https://dplyr.tidyverse.org/.
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse) # or library(dplyr)
library(nycflights13) # load flight data
dim(flights)
df <- flights %>%
# step 1
filter(distance < 1000) %>%
# step 2
select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
# step 3
mutate(
Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
sd(dep_delay, na.rm=TRUE),
dep_delay = dep_delay / 60,
arr_delay = arr_delay / 60,
speed = distance / (air_time / 60)
) %>%
# step 4
arrange(-speed) %>%
# step 5
print(n=4)
df <- tibble(a = c(1, 3, 2, 1, 3, 2), b = c(4, 4, 5, 5, 6, 6))
df %>% arrange(a)
df %>% arrange(b, a)
df %>% arrange(desc(b), a)
df %>% arrange(-a, b)
flights %>%
colnames()
# select by part of name:
flights %>%
select(starts_with("arr")) %>%
colnames()
flights %>%
select(contains("time")) %>%
colnames()
flights %>%
select(ends_with("delay")) %>%
colnames()
# using regular expressions
flights %>%
select(matches("(time|hour|minute)")) %>%
colnames()
# select columns using a vector of column names (all_of or any_of)
variables <- c("dep_delay", "arr_delay", "origin", "dest",
"air_time", "distance")
flights %>%
select(all_of(variables)) %>%
colnames()
# remove columns by name:
flights %>%
select(-c(carrier, flight, tailnum)) %>%
colnames()
flights %>%
select(! matches("(time|hour|minute)")) %>%
colnames()
df <- tibble(
row = c(1:7),
a = c(1, 2, 3, 1, 3, 1, NA),
b = c(4, 5, 6, 4, 5, 6, 8)
)
# filter by column values
df %>%
filter(a == 1)
df %>%
filter(a == 1 & b == 4)
df %>%
filter(a > 2 | b == 4)
df %>%
filter(a %in% c(1, 3))
# remove by volumn values - use the ! operator
df %>%
filter(! a %in% c(1, 3))
# filter to missing values in column
df %>%
filter(is.na(a))
df %>%
distinct(a, b, .keep_all = TRUE)
df %>%
slice(1:3)
df %>%
slice(1, 3, 5)
idx <- c(2, 4, 6)
df %>%
slice(idx)
df %>%
slice_min(a)
df %>%
slice_max(a, n=2)
df %>%
slice_sample(n=5) # sample without replacement
df %>%
slice_sample(n=5, replace=TRUE) # sample with replacement
df <- tibble(
a = c(1, 2, 3),
b = c(4, 5, 6)
)
# Add one or more new columns
df %>%
mutate(c = a + b)
df %>%
mutate(c = a + b, d = a - b, e = c * d)
# Modify existing columns
df %>%
mutate(a = a + 1)
df %>%
add_column(idx = seq_len(nrow(df)))
df %>%
add_column(idx = seq_len(nrow(df)), .before="a")
df %>%
add_row(a=4, b=7)
df %>%
add_row(a=4, b=7, .after=1)
df <- tibble(
a = c(1, 2, 3, -9999),
b = c(4, -9999, 6, 8)
)
df <- df %>%
mutate_all(na_if, -9999) %>%
print()
df <- tibble(
a = c(1, 2, 3, NA),
b = c(4, NA, 6, 8)
)
df %>%
drop_na() # drop rows with missing values
df %>%
drop_na(a) # drop rows in which column "a" has missing values
df %>%
replace_na(list(a = 0, b = 99))
df %>%
replace_na(as.list(colMeans(df, na.rm=TRUE)))
flights %>%
group_by(carrier) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE)) %>%
arrange(avg_dep_delay)
flights %>%
group_by(origin, dest) %>% # group by both origin and dest
summarise(max.delay = max(arr_delay, na.rm=TRUE),
avg.delay = mean(arr_delay, na.rm=TRUE),
min.delay = min(arr_delay, na.rm=TRUE),
count = n(), # n() gives the group count
.groups = "keep") # keep information about the grouping
flights %>%
summarize(max.delay = max(arr_delay, na.rm=TRUE),
avg.delay = mean(arr_delay, na.rm=TRUE),
min.delay = min(arr_delay, na.rm=TRUE),
count = n(),
.by = c(origin, dest))
df %>%
group_by(b) %>%
mutate(mean_a = mean(a)) %>%
ungroup()
flights %>%
count(origin)
flights %>%
group_by(origin) %>%
summarize(n=n())
df1 <- tibble(a=c(1, 2, 3), b=c(4, 5, 6))
df2 <- tibble(a=c(4, 5, 6), b=c(7, 8, 9), c=c(10, 11, 12))
# bind rows
bind_rows(df1, df2)
df3 <- tibble(c = c(4, 5, 6), d = c(7, 8, 9))
bind_cols(df1, df3)
df4 <- tibble(a = c(4, 5, 6), c = c(7, 8, 9))
bind_cols(df1, df4)