Chapter 2 Manipulating data

The dplyr package is the main component of the tidyverse for data maniuplation. You can load it either by loading the tidyverse package or by loading dplyr directly.

Code

library(tidyverse)    # or library(dplyr)

2.1 Example: manipulating flights data

We’ve already seen some of the basic functions in Chapter 1. Let’s look at a more elaborate example using a dataset from the nycflights13 package. This package contains information about all flights that departed from NYC (e.g., EWR, JFK and LGA) in 2013. The dataset is quite large and includes information on delays, airlines, airports, weather, and planes. It contains 336,776 rows and 19 columns.

We first load the nycflights13 package to get the flights dataset.

Code

library(nycflights13) # load flight data
dim(flights)

## [1] 336776     19

For our analysis, we want to focus on flights that were less than 1000 miles (distance) and restrict the dataset to keep only the columns: dep_delay, arr_delay, origin, dest, air_time, and distance. In addition, we convert the departure and arrival delays into hours, and calculate the average flight speed (in mph). We also want to add a new column with the Z-score for departure delays. Finally, we want to order by average flight speed (fastest to slowest) and return the first four rows.

This is the full data processing pipeline:

Code

df <- flights %>%
    # step 1
    filter(distance < 1000) %>%
    # step 2
    select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
    # step 3
    mutate(
        Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
                      sd(dep_delay, na.rm=TRUE),
        dep_delay = dep_delay / 60,
        arr_delay = arr_delay / 60,
        speed = distance / (air_time / 60)
    ) %>%
    # step 4
    arrange(-speed) %>%
    # step 5
    print(n=4)

## # A tibble: 189,671 × 8
##   dep_delay arr_delay origin dest  air_time distance Z_dep_delay speed
##       <dbl>     <dbl> <chr>  <chr>    <dbl>    <dbl>       <dbl> <dbl>
## 1    0.15     -0.233  LGA    ATL         65      762     -0.108   703.
## 2    0.25     -0.0167 EWR    GSP         55      594      0.0355  648 
## 3    0.0667    0.0333 EWR    BNA         70      748     -0.228   641.
## 4    0.267    -0.367  EWR    CVG         62      569      0.0595  551.
## # ℹ 189,667 more rows

Step 1: The filter function affects the rows of the dataframe. Here, it keeps only flights with a distance less than 1000 miles. The dplyr package allows us to use the column names without quotes and without the table name.

Step 2: select affects the columns of the dataframe. Here, it keeps the listed six columns. The select statement accepts a number of helper functions to select columns based on their names. For example, starts_with("delay") selects all columns that start with the word “delay”. The : operator is used to select a range of columns. Using this functionality, we could have written the select statement also as:

    select(ends_with("delay"), origin:distance) %>%

Step 3: This steps modifies columns and adds new ones. The first assignment, creates the new column Z_dep_delay which is the Z-score of the departure delay. The second and third assignments convert the departure and arrival delays from minutes to hours. The fourth assignment calculates the average flight speed in miles per hour.

Step 4: The arrange function is used to order the rows by speed. The - sign is used to indicate descending order.

Step 5: Finally, we print the first 4 rows. This serves only for information and doesn’t change the tibble which is assigned to the variable df.

2.2 Overview of `dplyr` functionality

We’ve seen some of the basic functions in the example above. In most cases, these will be sufficient for your data manipulation needs. It is however useful to have an understanding of the whole package.

Todo:

Visit the dplyr website (https://dplyr.tidyverse.org/) and read the Get started section. This will give you a good overview of the package.

The functionality can be organized into several categories.

2.2.1 Sorting data

The function arrange() is used to sort the data. Here are a few examples:

Code

df <- tibble(a = c(1, 3, 2, 1, 3, 2), b = c(4, 4, 5, 5, 6, 6))
df %>% arrange(a)

## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     1     5
## 3     2     5
## 4     2     6
## 5     3     4
## 6     3     6

Code

df %>% arrange(b, a)

## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     3     4
## 3     1     5
## 4     2     5
## 5     2     6
## 6     3     6

Code

df %>% arrange(desc(b), a)

## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     2     6
## 2     3     6
## 3     1     5
## 4     2     5
## 5     1     4
## 6     3     4

Code

df %>% arrange(-a, b)

## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     3     4
## 2     3     6
## 3     2     5
## 4     2     6
## 5     1     4
## 6     1     5

The examples show that we can sort by one or more columns. The default is ascending order. To sort in descending order, we can use the desc() function or the - sign. Multiple columns are sorted in the order they are listed. For example, -a, b first sorts by column a in descending order, then for rows with the same value in column a, it sorts by column b in ascending order.

2.2.2 Selecting columns

The function select() is used to select columns. It can be used to keep only certain columns or to drop columns. The select() function accepts a number of helper functions to select columns based on their names. Here are a few examples:

Code

flights %>%
    colnames()

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Code

# select by part of name:
flights %>%
    select(starts_with("arr")) %>%
    colnames()

## [1] "arr_time"  "arr_delay"

Code

flights %>%
    select(contains("time")) %>%
    colnames()

## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "time_hour"

Code

flights %>%
    select(ends_with("delay")) %>%
    colnames()

## [1] "dep_delay" "arr_delay"

Code

# using regular expressions
flights %>%
    select(matches("(time|hour|minute)")) %>%
    colnames()

## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "hour"           "minute"         "time_hour"

Code

# select columns using a vector of column names (all_of or any_of)
variables <- c("dep_delay", "arr_delay", "origin", "dest",
               "air_time", "distance")
flights %>%
    select(all_of(variables)) %>%
    colnames()

## [1] "dep_delay" "arr_delay" "origin"    "dest"      "air_time"  "distance"

Code

# remove columns by name:
flights %>%
    select(-c(carrier, flight, tailnum)) %>%
    colnames()

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "origin"         "dest"           "air_time"      
## [13] "distance"       "hour"           "minute"         "time_hour"

Code

flights %>%
    select(! matches("(time|hour|minute)")) %>%
    colnames()

##  [1] "year"      "month"     "day"       "dep_delay" "arr_delay" "carrier"  
##  [7] "flight"    "tailnum"   "origin"    "dest"      "distance"

2.2.3 Filtering rows

The function filter() is used to filter rows. It keeps only rows that satisfy the condition. Here are a few examples:

Code

df <- tibble(
    row = c(1:7),
    a = c(1, 2, 3, 1, 3, 1, NA),
    b = c(4, 5, 6, 4, 5, 6, 8)
)
# filter by column values
df %>%
    filter(a == 1)

## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4
## 3     6     1     6

Code

df %>%
    filter(a == 1 & b == 4)

## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4

Code

df %>%
    filter(a > 2 | b == 4)

## # A tibble: 4 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     4     1     4
## 4     5     3     5

Code

df %>%
    filter(a %in% c(1, 3))

## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     4     1     4
## 4     5     3     5
## 5     6     1     6

Code

# remove by volumn values - use the ! operator
df %>%
    filter(! a %in% c(1, 3))

## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     2     2     5
## 2     7    NA     8

Code

# filter to missing values in column
df %>%
    filter(is.na(a))

## # A tibble: 1 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     7    NA     8

distinct removes duplicate rows:

Code

df %>%
    distinct(a, b, .keep_all = TRUE)

## # A tibble: 6 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6
## 4     5     3     5
## 5     6     1     6
## 6     7    NA     8

It’s also possible to select rows using the row number. The function slice() is used for this purpose.

Code

df %>%
    slice(1:3)

## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6

Code

df %>%
    slice(1, 3, 5)

## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     5     3     5

Code

idx <- c(2, 4, 6)
df %>%
    slice(idx)

## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     2     2     5
## 2     4     1     4
## 3     6     1     6

slice_min and slice_max are useful if you want to identify the rows with the smallest or largest values in a column.

Code

df %>%
    slice_min(a)

## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4
## 3     6     1     6

Code

df %>%
    slice_max(a, n=2)

## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     3     3     6
## 2     5     3     5

slice_sample selects a random sample of rows.

Code

df %>%
    slice_sample(n=5)  # sample without replacement

## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     5     3     5
## 2     1     1     4
## 3     7    NA     8
## 4     3     3     6
## 5     4     1     4

Code

df %>%
    slice_sample(n=5, replace=TRUE)  # sample with replacement

## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     3     3     6
## 2     6     1     6
## 3     5     3     5
## 4     1     1     4
## 5     6     1     6

2.2.4 Modifying tables

We’ve already seen the mutate function that is used to add new columns or modify existing ones.

Code

df <- tibble(
    a = c(1, 2, 3),
    b = c(4, 5, 6)
)
# Add one or more new columns
df %>%
    mutate(c = a + b)

## # A tibble: 3 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     4     5
## 2     2     5     7
## 3     3     6     9

Code

df %>%
    mutate(c = a + b, d = a - b, e = c * d)

## # A tibble: 3 × 5
##       a     b     c     d     e
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     4     5    -3   -15
## 2     2     5     7    -3   -21
## 3     3     6     9    -3   -27

The last example shows that the new columns (e) can be based on previously created columns (c and d).

Code

# Modify existing columns
df %>%
    mutate(a = a + 1)

## # A tibble: 3 × 2
##       a     b
##   <dbl> <dbl>
## 1     2     4
## 2     3     5
## 3     4     6

If a column name already exists, the original column will be replaced. To avoid this, use the add_column function. It will throw an error if the column name exists. The add_column function can also be used to add a column in a specific position.

Code

df %>%
    add_column(idx = seq_len(nrow(df)))

## # A tibble: 3 × 3
##       a     b   idx
##   <dbl> <dbl> <int>
## 1     1     4     1
## 2     2     5     2
## 3     3     6     3

Code

df %>%
    add_column(idx = seq_len(nrow(df)), .before="a")

## # A tibble: 3 × 3
##     idx     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6

add_row adds new rows to the table. We can again use .before and .after to specify the position.

Code

df %>%
    add_row(a=4, b=7)

## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2     5
## 3     3     6
## 4     4     7

Code

df %>%
    add_row(a=4, b=7, .after=1)

## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     4     7
## 3     2     5
## 4     3     6

2.3 Handling missing values

Real life data are often messy and contain missing values. The tidyverse packages dplyr and tidyr have functions that help dealing with this. In datasets, missing values are not always identified using NA. Sometimes, they are identified using a placeholder value, e.g. -9999.

2.3.1 Convert placeholders to missing values

Here is an example of a dataset with placeholder values that get converted to missing values.

Code

df <- tibble(
    a = c(1, 2, 3, -9999),
    b = c(4, -9999, 6, 8)
)
df <- df %>%
    mutate_all(na_if, -9999) %>%
    print()

## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    NA
## 3     3     6
## 4    NA     8

2.3.2 Remove or replace missing values

The tidyr package has quite a few functions for handling missing values. We already encountered the drop_na function to remove rows with missing values.

Code

df <- tibble(
    a = c(1, 2, 3, NA),
    b = c(4, NA, 6, 8)
)
df %>%
    drop_na()  # drop rows with missing values

## # A tibble: 2 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     3     6

Code

df %>%
    drop_na(a)  # drop rows in which column "a" has missing values

## # A tibble: 3 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    NA
## 3     3     6

The replace_na function replaces missing values with a specified value.

Code

df %>%
    replace_na(list(a = 0, b = 99))

## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    99
## 3     3     6
## 4     0     8

To replace the missing values with the mean of each column, we can use:

Code

df %>%
    replace_na(as.list(colMeans(df, na.rm=TRUE)))

## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2     6
## 3     3     6
## 4     2     8

2.4 Split - Apply - Combine

The dplyr operations are even more powerful when they can be used with grouping variables. This approach to data processing is also known as Split - Apply - Combine (Wickham 2011).

Split: The data is split into groups based on one or more grouping variables.
Apply: A function is applied to each group independently.
Combine: The results are combined into a new data structure.

2.4.1 Two examples

Let’s see this approach in action using the flights dataset. We want to calculate the average departure delay for each airline and sort the resulting table by the average delay.

Code

flights %>%
    group_by(carrier) %>%
    summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE)) %>%
    arrange(avg_dep_delay)

## # A tibble: 16 × 2
##    carrier avg_dep_delay
##    <chr>           <dbl>
##  1 US               3.78
##  2 HA               4.90
##  3 AS               5.80
##  4 AA               8.59
##  5 DL               9.26
##  6 MQ              10.6 
##  7 UA              12.1 
##  8 OO              12.6 
##  9 VX              12.9 
## 10 B6              13.0 
## 11 9E              16.7 
## 12 WN              17.7 
## 13 FL              18.7 
## 14 YV              19.0 
## 15 EV              20.0 
## 16 F9              20.2

Step 1: The group_by function is used to specify the grouping variable(s). In this case, we group by the carrier variable. Note: grouping should to be applied on discrete variables (categorical, factor, or maybe integer valued columns).

Step 2: This step is applied to the individual groups. The summarize function is used to calculate the average departure delay for each group. The na.rm=TRUE argument is used to ignore missing values. Following the summarization, the data is combined into a single tibble. It contains only the grouping variable(s) and the new summary variable(s).

Step 3: We can now continue with normal processing. The arrange function is used to sort the resulting table by the average delay.

We can group by multiple variables. For example,

Code

flights %>%
    group_by(origin, dest) %>%   # group by both origin and dest
    summarise(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n(),       # n() gives the group count
              .groups = "keep")  # keep information about the grouping

## # A tibble: 224 × 6
## # Groups:   origin, dest [224]
##    origin dest  max.delay avg.delay min.delay count
##    <chr>  <chr>     <dbl>     <dbl>     <dbl> <int>
##  1 EWR    ALB         328    14.4         -34   439
##  2 EWR    ANC          39    -2.5         -47     8
##  3 EWR    ATL         796    13.2         -39  5022
##  4 EWR    AUS         349    -0.474       -59   968
##  5 EWR    AVL         228     8.80        -26   265
##  6 EWR    BDL         266     7.05        -43   443
##  7 EWR    BNA         364    12.7         -41  2336
##  8 EWR    BOS         422     4.78        -47  5327
##  9 EWR    BQN         208    10.9         -43   297
## 10 EWR    BTV         306    12.2         -41   931
## # ℹ 214 more rows

Useful to know: Instead of first grouping the data and then summarizing, we can use the .by argument int the summarize function. This is more efficient and can be used for simple summaries.

Code

flights %>%
    summarize(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n(),
              .by = c(origin, dest))

## # A tibble: 224 × 6
##    origin dest  max.delay avg.delay min.delay count
##    <chr>  <chr>     <dbl>     <dbl>     <dbl> <int>
##  1 EWR    IAH         374      5.41       -63  3973
##  2 LGA    IAH         435      1.45       -59  2951
##  3 JFK    MIA         614     -1.99       -64  3314
##  4 JFK    BQN         183      6.94       -32   599
##  5 LGA    ATL         895     11.3        -49 10263
##  6 EWR    ORD        1109      9.00       -59  6100
##  7 EWR    FLL         375      7.76       -56  3793
##  8 LGA    IAD         398     13.7        -47  1803
##  9 JFK    MCO         329      5.30       -63  5464
## 10 LGA    ORD         448      1.83       -62  8857
## # ℹ 214 more rows

2.4.2 Applying functions to groups adding to the original table

The mutate function can be used to add new columns to the original table but calculated for each group individually. For example,

Code

df %>%
    group_by(b) %>%
    mutate(mean_a = mean(a)) %>%
    ungroup()

## # A tibble: 4 × 3
##       a     b mean_a
##   <dbl> <dbl>  <dbl>
## 1     1     4      1
## 2     2    NA      2
## 3     3     6      3
## 4    NA     8     NA

Step 1: Group the data using the column b

Step 2: Calculate the mean of column a for each group. The mutate function applies in this case only to the group and not the full table.

Step 3: To restore the full table, use ungroup. This removes the grouping.

2.4.3 Shortcuts

Some Split - Apply - Combine steps occur so frequently that dplyr provides shortcuts for them.

Code

flights %>%
    count(origin)

## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

is equivalent to

Code

flights %>%
    group_by(origin) %>%
    summarize(n=n())

## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

2.5 Concatenate and join data

2.5.1 Concatenating data

Concatenating data is the process of combining two or more datasets into a single one. The dplyr package provides the bind_rows and bind_cols functions for this purpose. The bind_rows function is used to combine rows from two datasets. The bind_cols function is used to combine columns from two datasets.

The dplyr::bind_rows function combines rows from two or more datasets. This is useful if your dataset is split into parts and you need to combine them.

Code

df1 <- tibble(a=c(1, 2, 3), b=c(4, 5, 6))
df2 <- tibble(a=c(4, 5, 6), b=c(7, 8, 9), c=c(10, 11, 12))

# bind rows
bind_rows(df1, df2)

## # A tibble: 6 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     4    NA
## 2     2     5    NA
## 3     3     6    NA
## 4     4     7    10
## 5     5     8    11
## 6     6     9    12

Columns are matched by name. If a data frame has missing columns, the missing columns are filled with NA.

The dplyr::bind_cols function combines columns from two or more datasets. This is useful if you want to add columns to an existing dataset.

Code

df3 <- tibble(c = c(4, 5, 6), d = c(7, 8, 9))
bind_cols(df1, df3)

## # A tibble: 3 × 4
##       a     b     c     d
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     4     7
## 2     2     5     5     8
## 3     3     6     6     9

The rows are combined in the order they appear, so be careful that the rows match. If you have a common identifier in the data frames, using and of the join functions is a safer option.

If columns occur in both dataframes, the column names will be adjusted to be unique. In the following example, the a columns are renamed to a...1 and a...3.

Code

df4 <- tibble(a = c(4, 5, 6), c = c(7, 8, 9))
bind_cols(df1, df4)

## New names:
## • `a` -> `a...1`
## • `a` -> `a...3`

## # A tibble: 3 × 4
##   a...1     b a...3     c
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     4     7
## 2     2     5     5     8
## 3     3     6     6     9

2.5.2 Joins

Joins are used to combine or merge two datasets, based on common identfiiers. The dplyr package provides a number of functions for this purpose. The main functions are inner_join, left_join, right_join, and full_join. The inner_join function is the most commonly used one. It keeps only rows that have matching values in both datasets. The left_join function keeps all rows from the first dataset and adds columns from the second dataset if there is a match. The right_join function is the reverse of left_join. The full_join function keeps all rows from both datasets. We will not cover this here but you can read more about joins in the R4DS book.

2.6 Additional tidyr functions

The tidyr package has a few more functions that are useful for data manipulation and cleanup. We list these here for completeness.

pivot_wider()/spread(): Spreads a pair of key:value columns into a set of tidy columns
pivot_longer()/gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use pivot_longer()/gather() when you notice that you have columns that are not variables
separate() turns a single character column into multiple columns
unite() paste together multiple columns into one (reverse of separate())

Further information:

The dplyr cheatsheet is a two-page summary of all the main features of dplyr. For more details about dplyr, see the main website at https://dplyr.tidyverse.org/.

Code

The code of this chapter is summarized here.

Code

knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)    # or library(dplyr)
library(nycflights13) # load flight data
dim(flights)
df <- flights %>%
    # step 1
    filter(distance < 1000) %>%
    # step 2
    select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
    # step 3
    mutate(
        Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
                      sd(dep_delay, na.rm=TRUE),
        dep_delay = dep_delay / 60,
        arr_delay = arr_delay / 60,
        speed = distance / (air_time / 60)
    ) %>%
    # step 4
    arrange(-speed) %>%
    # step 5
    print(n=4)
df <- tibble(a = c(1, 3, 2, 1, 3, 2), b = c(4, 4, 5, 5, 6, 6))
df %>% arrange(a)
df %>% arrange(b, a)
df %>% arrange(desc(b), a)
df %>% arrange(-a, b)
flights %>%
    colnames()
# select by part of name:
flights %>%
    select(starts_with("arr")) %>%
    colnames()
flights %>%
    select(contains("time")) %>%
    colnames()
flights %>%
    select(ends_with("delay")) %>%
    colnames()
# using regular expressions
flights %>%
    select(matches("(time|hour|minute)")) %>%
    colnames()
# select columns using a vector of column names (all_of or any_of)
variables <- c("dep_delay", "arr_delay", "origin", "dest",
               "air_time", "distance")
flights %>%
    select(all_of(variables)) %>%
    colnames()
# remove columns by name:
flights %>%
    select(-c(carrier, flight, tailnum)) %>%
    colnames()
flights %>%
    select(! matches("(time|hour|minute)")) %>%
    colnames()
df <- tibble(
    row = c(1:7),
    a = c(1, 2, 3, 1, 3, 1, NA),
    b = c(4, 5, 6, 4, 5, 6, 8)
)
# filter by column values
df %>%
    filter(a == 1)
df %>%
    filter(a == 1 & b == 4)
df %>%
    filter(a > 2 | b == 4)
df %>%
    filter(a %in% c(1, 3))
# remove by volumn values - use the ! operator
df %>%
    filter(! a %in% c(1, 3))
# filter to missing values in column
df %>%
    filter(is.na(a))
df %>%
    distinct(a, b, .keep_all = TRUE)
df %>%
    slice(1:3)
df %>%
    slice(1, 3, 5)
idx <- c(2, 4, 6)
df %>%
    slice(idx)
df %>%
    slice_min(a)
df %>%
    slice_max(a, n=2)
df %>%
    slice_sample(n=5)  # sample without replacement
df %>%
    slice_sample(n=5, replace=TRUE)  # sample with replacement
df <- tibble(
    a = c(1, 2, 3),
    b = c(4, 5, 6)
)
# Add one or more new columns
df %>%
    mutate(c = a + b)
df %>%
    mutate(c = a + b, d = a - b, e = c * d)
# Modify existing columns
df %>%
    mutate(a = a + 1)
df %>%
    add_column(idx = seq_len(nrow(df)))
df %>%
    add_column(idx = seq_len(nrow(df)), .before="a")
df %>%
    add_row(a=4, b=7)
df %>%
    add_row(a=4, b=7, .after=1)
df <- tibble(
    a = c(1, 2, 3, -9999),
    b = c(4, -9999, 6, 8)
)
df <- df %>%
    mutate_all(na_if, -9999) %>%
    print()
df <- tibble(
    a = c(1, 2, 3, NA),
    b = c(4, NA, 6, 8)
)
df %>%
    drop_na()  # drop rows with missing values
df %>%
    drop_na(a)  # drop rows in which column "a" has missing values
df %>%
    replace_na(list(a = 0, b = 99))
df %>%
    replace_na(as.list(colMeans(df, na.rm=TRUE)))
flights %>%
    group_by(carrier) %>%
    summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE)) %>%
    arrange(avg_dep_delay)
flights %>%
    group_by(origin, dest) %>%   # group by both origin and dest
    summarise(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n(),       # n() gives the group count
              .groups = "keep")  # keep information about the grouping
flights %>%
    summarize(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n(),
              .by = c(origin, dest))
df %>%
    group_by(b) %>%
    mutate(mean_a = mean(a)) %>%
    ungroup()
flights %>%
    count(origin)
flights %>%
    group_by(origin) %>%
    summarize(n=n())
df1 <- tibble(a=c(1, 2, 3), b=c(4, 5, 6))
df2 <- tibble(a=c(4, 5, 6), b=c(7, 8, 9), c=c(10, 11, 12))

# bind rows
bind_rows(df1, df2)
df3 <- tibble(c = c(4, 5, 6), d = c(7, 8, 9))
bind_cols(df1, df3)
df4 <- tibble(a = c(4, 5, 6), c = c(7, 8, 9))
bind_cols(df1, df4)

References

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (April): 1–29. https://doi.org/10.18637/jss.v040.i01.

Chapter 2 Manipulating data

2.1 Example: manipulating flights data

2.2 Overview of dplyr functionality

2.2.1 Sorting data

2.2.2 Selecting columns

2.2.3 Filtering rows

2.2.4 Modifying tables

2.3 Handling missing values

2.3.1 Convert placeholders to missing values

2.3.2 Remove or replace missing values

2.4 Split - Apply - Combine

2.4.1 Two examples

2.4.2 Applying functions to groups adding to the original table

2.4.3 Shortcuts

2.5 Concatenate and join data

2.5.1 Concatenating data

2.5.2 Joins

2.6 Additional tidyr functions

Code

References

2.2 Overview of `dplyr` functionality