Chapter 2 Manipulating data

The dplyr package is the main component of the tidyverse for data maniuplation. You can load it either by loading the tidyverse package or by loading dplyr directly.

Code
library(tidyverse)    # or library(dplyr) 

2.1 Example: manipulating flights data

We’ve already seen some of the basic functions in Chapter 1. Let’s look at a more elaborate example using a dataset from the nycflights13 package. This package contains information about all flights that departed from NYC (e.g., EWR, JFK and LGA) in 2013. The dataset is quite large and includes information on delays, airlines, airports, weather, and planes. It contains 336,776 rows and 19 columns.

We first load the nycflights13 package to get the flights dataset.

Code
library(nycflights13) # load flight data
dim(flights)
## [1] 336776     19

For our analysis, we want to focus on flights that were less than 1000 miles (distance) and restrict the dataset to keep only the columns: dep_delay, arr_delay, origin, dest, air_time, and distance. In addition, we convert the departure and arrival delays into hours, and calculate the average flight speed (in mph). We also want to add a new column with the Z-score for departure delays. Finally, we want to order by average flight speed (fastest to slowest) and return the first four rows.

This is the full data processing pipeline:

Code
df <- flights %>% 
    # step 1
    filter(distance < 1000) %>%   
    # step 2
    select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
    # step 3
    mutate(
        Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
                      sd(dep_delay, na.rm=TRUE), 
        dep_delay = dep_delay/60, 
        arr_delay=arr_delay/60,
        speed = distance/(air_time/60)
    ) %>%
    # step 4
    arrange(-speed) %>% 
    # step 5
    print(n=4)
## # A tibble: 189,671 × 8
##   dep_delay arr_delay origin dest  air_time distance Z_dep_delay speed
##       <dbl>     <dbl> <chr>  <chr>    <dbl>    <dbl>       <dbl> <dbl>
## 1    0.15     -0.233  LGA    ATL         65      762     -0.108   703.
## 2    0.25     -0.0167 EWR    GSP         55      594      0.0355  648 
## 3    0.0667    0.0333 EWR    BNA         70      748     -0.228   641.
## 4    0.267    -0.367  EWR    CVG         62      569      0.0595  551.
## # ℹ 189,667 more rows

Step 1: The filter function affects the rows of the dataframe. Here, it keeps only flights with a distance less than 1000 miles. The dplyr package allows us to use the column names without quotes and without the table name.

Step 2: select affects the columns of the dataframe. Here, it keeps the listed six columns. The select statement accepts a number of helper functions to select columns based on their names. For example, starts_with("delay") selects all columns that start with the word “delay”. The : operator is used to select a range of columns. Using this functionality, we could have written the select statement also as:

    select(ends_with("delay"), origin:distance) %>%

Step 3: This steps modifies columns and adds new ones. The first assignment, creates the new column Z_dep_delay which is the Z-score of the departure delay. The second and third assignments convert the departure and arrival delays from minutes to hours. The fourth assignment calculates the average flight speed in miles per hour.

Step 4: The arrange function is used to order the rows by speed. The - sign is used to indicate descending order.

Step 5: Finally, we print the first 4 rows. This serves only for information and doesn’t change the tibble which is assigned to the variable df.

2.2 Overview of dplyr functionality

We’ve seen some of the basic functions in the example above. In most cases, these will be sufficient for your data manipulation needs. It is however useful to have an understanding of the whole package.

Todo:

Visit the dplyr website (https://dplyr.tidyverse.org/) and read the Get started section. This will give you a good overview of the package.

The functionality can be organized into several categories.

2.2.1 Sorting data

The function arrange() is used to sort the data. Here are a few examples:

Code
df <- tibble(a = c(1, 3, 2, 1, 3, 2), b = c(4, 4, 5, 5, 6, 6))
df %>% arrange(a)
## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     1     5
## 3     2     5
## 4     2     6
## 5     3     4
## 6     3     6
Code
df %>% arrange(b, a)
## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     3     4
## 3     1     5
## 4     2     5
## 5     2     6
## 6     3     6
Code
df %>% arrange(desc(b), a)
## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     2     6
## 2     3     6
## 3     1     5
## 4     2     5
## 5     1     4
## 6     3     4
Code
df %>% arrange(-a, b)
## # A tibble: 6 × 2
##       a     b
##   <dbl> <dbl>
## 1     3     4
## 2     3     6
## 3     2     5
## 4     2     6
## 5     1     4
## 6     1     5

The examples show that we can sort by one or more columns. The default is ascending order. To sort in descending order, we can use the desc() function or the - sign. Multiple columns are sorted in the order they are listed. For example, -a, b first sorts by column a in descending order, then for rows with the same value in column a, it sorts by column b in ascending order.

2.2.2 Selecting columns

The function select() is used to select columns. It can be used to keep only certain columns or to drop columns. The select() function accepts a number of helper functions to select columns based on their names. Here are a few examples:

Code
flights %>% 
    colnames()
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"
Code
# select by part of name:
flights %>% 
    select(starts_with("arr")) %>% 
    colnames()
## [1] "arr_time"  "arr_delay"
Code
flights %>% 
    select(contains("time")) %>% 
    colnames()
## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "time_hour"
Code
flights %>% 
    select(ends_with("delay")) %>% 
    colnames()
## [1] "dep_delay" "arr_delay"
Code
# using regular expressions
flights %>% 
    select(matches("(time|hour|minute)")) %>% 
    colnames()
## [1] "dep_time"       "sched_dep_time" "arr_time"       "sched_arr_time"
## [5] "air_time"       "hour"           "minute"         "time_hour"
Code
# select columns using a vector of column names (all_of or any_of)
variables <- c("dep_delay", "arr_delay", "origin", "dest", 
               "air_time", "distance")
flights %>% 
    select(all_of(variables)) %>% 
    colnames()
## [1] "dep_delay" "arr_delay" "origin"    "dest"      "air_time"  "distance"
Code
# remove columns by name:
flights %>% 
    select(-c(carrier, flight, tailnum)) %>% 
    colnames()
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "origin"         "dest"           "air_time"      
## [13] "distance"       "hour"           "minute"         "time_hour"
Code
flights %>% 
    select(! matches("(time|hour|minute)")) %>% 
    colnames()
##  [1] "year"      "month"     "day"       "dep_delay" "arr_delay" "carrier"  
##  [7] "flight"    "tailnum"   "origin"    "dest"      "distance"

2.2.3 Filtering rows

The function filter() is used to filter rows. It keeps only rows that satisfy the condition. Here are a few examples:

Code
df <- tibble(
    row = c(1:7), 
    a = c(1, 2, 3, 1, 3, 1, NA), 
    b = c(4, 5, 6, 4, 5, 6, 8)
)
# filter by column values 
df %>% 
    filter(a == 1)
## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4
## 3     6     1     6
Code
df %>% 
    filter(a == 1 & b == 4)
## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4
Code
df %>% 
    filter(a > 2 | b == 4)
## # A tibble: 4 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     4     1     4
## 4     5     3     5
Code
df %>% 
    filter(a %in% c(1, 3))
## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     4     1     4
## 4     5     3     5
## 5     6     1     6
Code
# remove by volumn values - use the ! operator
df %>% 
    filter(! a %in% c(1, 3))
## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     2     2     5
## 2     7    NA     8
Code
# filter to missing values in column
df %>% 
    filter(is.na(a))
## # A tibble: 1 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     7    NA     8

distinct removes duplicate rows:

Code
df %>% 
    distinct(a, b, .keep_all = TRUE)
## # A tibble: 6 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6
## 4     5     3     5
## 5     6     1     6
## 6     7    NA     8

It’s also possible to select rows using the row number. The function slice() is used for this purpose.

Code
df %>% 
    slice(1:3)
## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6
Code
df %>% 
    slice(1, 3, 5)
## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     3     3     6
## 3     5     3     5
Code
idx <- c(2, 4, 6)
df %>% 
    slice(idx)
## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     2     2     5
## 2     4     1     4
## 3     6     1     6

slice_min and slice_max are useful if you want to identify the rows with the smallest or largest values in a column.

Code
df %>% 
    slice_min(a)
## # A tibble: 3 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     4     1     4
## 3     6     1     6
Code
df %>% 
    slice_max(a, n=2)
## # A tibble: 2 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     3     3     6
## 2     5     3     5

slice_sample selects a random sample of rows.

Code
df %>% 
    slice_sample(n=5)  # sample without replacement
## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     5     3     5
## 2     2     2     5
## 3     6     1     6
## 4     4     1     4
## 5     7    NA     8
Code
df %>% 
    slice_sample(n=5, replace=TRUE)  # sample with replacement
## # A tibble: 5 × 3
##     row     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     1     1     4
## 3     7    NA     8
## 4     3     3     6
## 5     1     1     4

2.2.4 Modifying tables

We’ve already seen the mutate function that is used to add new columns or modify existing ones.

Code
df <- tibble(
    a = c(1, 2, 3), 
    b = c(4, 5, 6)
)
# Add one or more new columns
df %>% 
    mutate(c = a + b)
## # A tibble: 3 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     4     5
## 2     2     5     7
## 3     3     6     9
Code
df %>% 
    mutate(c = a + b, d = a - b, e = c * d)
## # A tibble: 3 × 5
##       a     b     c     d     e
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     4     5    -3   -15
## 2     2     5     7    -3   -21
## 3     3     6     9    -3   -27

The last example shows that the new columns (e) can be based on previously created columns (c and d).

Code
# Modify existing columns
df %>% 
    mutate(a = a + 1)
## # A tibble: 3 × 2
##       a     b
##   <dbl> <dbl>
## 1     2     4
## 2     3     5
## 3     4     6

If a column name already exists, the original column will be replaced. To avoid this, use the add_column function. It will throw an error if the column name exists. The add_column function can also be used to add a column in a specific position.

Code
df %>% 
    add_column(idx = 1:nrow(df))
## # A tibble: 3 × 3
##       a     b   idx
##   <dbl> <dbl> <int>
## 1     1     4     1
## 2     2     5     2
## 3     3     6     3
Code
df %>% 
    add_column(idx = 1:nrow(df), .before="a")
## # A tibble: 3 × 3
##     idx     a     b
##   <int> <dbl> <dbl>
## 1     1     1     4
## 2     2     2     5
## 3     3     3     6

add_row adds new rows to the table. We can again use .before and .after to specify the position.

Code
df %>% 
    add_row(a=4, b=7)
## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2     5
## 3     3     6
## 4     4     7
Code
df %>% 
    add_row(a=4, b=7, .after=1)
## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     4     7
## 3     2     5
## 4     3     6

2.3 Handling missing values

Real life data are often messy and contain missing values. The tidyverse packages dplyr and tidyr have functions that help dealing with this. In datasets, missing values are not always identified using NA. Sometimes, they are identified using a placeholder value, e.g. -9999.

2.3.1 Convert placeholders to missing values

Here is an example of a dataset with placeholder values that get converted to missing values.

Code
df <- tibble(
    a = c(1, 2, 3, -9999), 
    b = c(4, -9999, 6, 8)
)
df <- df %>% 
    mutate_all(na_if, -9999) %>% 
    print()
## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    NA
## 3     3     6
## 4    NA     8

2.3.2 Remove or replace missing values

The tidyr package has quite a few functions for handling missing values. We already encountered the drop_na function to remove rows with missing values.

Code
df <- tibble(
    a = c(1, 2, 3, NA), 
    b = c(4, NA, 6, 8)
)
df %>% 
    drop_na()  # drop rows with missing values
## # A tibble: 2 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     3     6
Code
df %>% 
    drop_na(a)  # drop rows in which column "a" has missing values
## # A tibble: 3 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    NA
## 3     3     6

The replace_na function replaces missing values with a specified value.

Code
df %>% 
    replace_na(list(a = 0, b = 99))
## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2    99
## 3     3     6
## 4     0     8

To replace the missing values with the mean of each column, we can use:

Code
df %>% 
    replace_na(as.list(colMeans(df, na.rm=TRUE)))
## # A tibble: 4 × 2
##       a     b
##   <dbl> <dbl>
## 1     1     4
## 2     2     6
## 3     3     6
## 4     2     8

2.4 Split - Apply - Combine

The dplyr operations are even more powerful when they can be used with grouping variables. This approach to data processing is also known as Split - Apply - Combine (Wickham 2011).

  • Split: The data is split into groups based on one or more grouping variables.
  • Apply: A function is applied to each group independently.
  • Combine: The results are combined into a new data structure.

2.4.1 Two examples

Let’s see this approach in action using the flights dataset. We want to calculate the average departure delay for each airline and sort the resulting table by the average delay.

Code
flights %>% 
    group_by(carrier) %>% 
    summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE)) %>%
    arrange(avg_dep_delay) 
## # A tibble: 16 × 2
##    carrier avg_dep_delay
##    <chr>           <dbl>
##  1 US               3.78
##  2 HA               4.90
##  3 AS               5.80
##  4 AA               8.59
##  5 DL               9.26
##  6 MQ              10.6 
##  7 UA              12.1 
##  8 OO              12.6 
##  9 VX              12.9 
## 10 B6              13.0 
## 11 9E              16.7 
## 12 WN              17.7 
## 13 FL              18.7 
## 14 YV              19.0 
## 15 EV              20.0 
## 16 F9              20.2

Step 1: The group_by function is used to specify the grouping variable(s). In this case, we group by the carrier variable. Note: grouping should to be applied on discrete variables (categorical, factor, or maybe integer valued columns).

Step 2: This step is applied to the individual groups. The summarize function is used to calculate the average departure delay for each group. The na.rm=TRUE argument is used to ignore missing values. Following the summarization, the data is combined into a single tibble. It contains only the grouping variable(s) and the new summary variable(s).

Step 3: We can now continue with normal processing. The arrange function is used to sort the resulting table by the average delay.

We can group by multiple variables. For example,

Code
flights %>% 
    group_by(origin, dest) %>%   # group by both origin and dest
    summarize(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n() )        # n() gives the group count
## Warning: There were 2 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `max.delay = max(arr_delay, na.rm = TRUE)`.
## ℹ In group 41: `origin = "EWR"` and `dest = "LGA"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
## # A tibble: 224 × 6
## # Groups:   origin [3]
##    origin dest  max.delay avg.delay min.delay count
##    <chr>  <chr>     <dbl>     <dbl>     <dbl> <int>
##  1 EWR    ALB         328    14.4         -34   439
##  2 EWR    ANC          39    -2.5         -47     8
##  3 EWR    ATL         796    13.2         -39  5022
##  4 EWR    AUS         349    -0.474       -59   968
##  5 EWR    AVL         228     8.80        -26   265
##  6 EWR    BDL         266     7.05        -43   443
##  7 EWR    BNA         364    12.7         -41  2336
##  8 EWR    BOS         422     4.78        -47  5327
##  9 EWR    BQN         208    10.9         -43   297
## 10 EWR    BTV         306    12.2         -41   931
## # ℹ 214 more rows

2.4.2 Applying functions to groups adding to the original table

The mutate function can be used to add new columns to the original table but calculated for each group individually. For example,

Code
df %>% 
    group_by(b) %>% 
    mutate(mean_a = mean(a)) %>% 
    ungroup()
## # A tibble: 4 × 3
##       a     b mean_a
##   <dbl> <dbl>  <dbl>
## 1     1     4      1
## 2     2    NA      2
## 3     3     6      3
## 4    NA     8     NA

Step 1: Group the data using the column b

Step 2: Calculate the mean of column a for each group. The mutate function applies in this case only to the group and not the full table.

Step 3: To restore the full table, use ungroup. This removes the grouping.

2.4.3 Shortcuts

Some Split - Apply - Combine steps occur so frequently that dplyr provides shortcuts for them.

Code
flights %>% 
    count(origin)
## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

is equivalent to

Code
flights %>% 
    group_by(origin) %>% 
    summarize(n=n())
## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

2.5 Concatenate and join data

2.5.1 Concatenating data

Concatenating data is the process of combining two or more datasets into a single one. The dplyr package provides the bind_rows and bind_cols functions for this purpose. The bind_rows function is used to combine rows from two datasets. The bind_cols function is used to combine columns from two datasets.

The dplyr::bind_rows function combines rows from two or more datasets. This is useful if your dataset is split into parts and you need to combine them.

Code
df1 <- tibble(a=c(1, 2, 3), b=c(4, 5, 6))
df2 <- tibble(a=c(4, 5, 6), b=c(7, 8, 9), c=c(10, 11, 12))

# bind rows
bind_rows(df1, df2)
## # A tibble: 6 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     4    NA
## 2     2     5    NA
## 3     3     6    NA
## 4     4     7    10
## 5     5     8    11
## 6     6     9    12

Columns are matched by name. If a data frame has missing columns, the missing columns are filled with NA.

The dplyr::bind_cols function combines columns from two or more datasets. This is useful if you want to add columns to an existing dataset.

Code
df3 <- tibble(c = c(4, 5, 6), d = c(7, 8, 9))
bind_cols(df1, df3)
## # A tibble: 3 × 4
##       a     b     c     d
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     4     7
## 2     2     5     5     8
## 3     3     6     6     9

The rows are combined in the order they appear, so be careful that the rows match. If you have a common identifier in the data frames, using and of the join functions is a safer option.

If columns occur in both dataframes, the column names will be adjusted to be unique. In the following example, the a columns are renamed to a...1 and a...3.

Code
df4 <- tibble(a = c(4, 5, 6), c = c(7, 8, 9))
bind_cols(df1, df4)
## New names:
## • `a` -> `a...1`
## • `a` -> `a...3`
## # A tibble: 3 × 4
##   a...1     b a...3     c
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     4     7
## 2     2     5     5     8
## 3     3     6     6     9

2.5.2 Joins

Joins are used to combine or merge two datasets, based on common identfiiers. The dplyr package provides a number of functions for this purpose. The main functions are inner_join, left_join, right_join, and full_join. The inner_join function is the most commonly used one. It keeps only rows that have matching values in both datasets. The left_join function keeps all rows from the first dataset and adds columns from the second dataset if there is a match. The right_join function is the reverse of left_join. The full_join function keeps all rows from both datasets. We will not cover this here but you can read more about joins in the R4DS book.

2.6 Additional tidyr functions

The tidyr package has a few more functions that are useful for data manipulation and cleanup. We list these here for completeness.

  • pivot_wider()/spread(): Spreads a pair of key:value columns into a set of tidy columns
  • pivot_longer()/gather(): Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use pivot_longer()/gather() when you notice that you have columns that are not variables
  • separate() turns a single character column into multiple columns
  • unite() paste together multiple columns into one (reverse of separate())

Further information:

The dplyr cheatsheet is a two-page summary of all the main features of dplyr. For more details about dplyr, see the main website at https://dplyr.tidyverse.org/.

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)    # or library(dplyr) 
library(nycflights13) # load flight data
dim(flights)
df <- flights %>% 
    # step 1
    filter(distance < 1000) %>%   
    # step 2
    select(dep_delay, arr_delay, origin, dest, air_time, distance) %>%
    # step 3
    mutate(
        Z_dep_delay = (dep_delay - mean(dep_delay, na.rm=TRUE)) /
                      sd(dep_delay, na.rm=TRUE), 
        dep_delay = dep_delay/60, 
        arr_delay=arr_delay/60,
        speed = distance/(air_time/60)
    ) %>%
    # step 4
    arrange(-speed) %>% 
    # step 5
    print(n=4)
df <- tibble(a = c(1, 3, 2, 1, 3, 2), b = c(4, 4, 5, 5, 6, 6))
df %>% arrange(a)
df %>% arrange(b, a)
df %>% arrange(desc(b), a)
df %>% arrange(-a, b)
flights %>% 
    colnames()
# select by part of name:
flights %>% 
    select(starts_with("arr")) %>% 
    colnames()
flights %>% 
    select(contains("time")) %>% 
    colnames()
flights %>% 
    select(ends_with("delay")) %>% 
    colnames()
# using regular expressions
flights %>% 
    select(matches("(time|hour|minute)")) %>% 
    colnames()
# select columns using a vector of column names (all_of or any_of)
variables <- c("dep_delay", "arr_delay", "origin", "dest", 
               "air_time", "distance")
flights %>% 
    select(all_of(variables)) %>% 
    colnames()
# remove columns by name:
flights %>% 
    select(-c(carrier, flight, tailnum)) %>% 
    colnames()
flights %>% 
    select(! matches("(time|hour|minute)")) %>% 
    colnames()
df <- tibble(
    row = c(1:7), 
    a = c(1, 2, 3, 1, 3, 1, NA), 
    b = c(4, 5, 6, 4, 5, 6, 8)
)
# filter by column values 
df %>% 
    filter(a == 1)
df %>% 
    filter(a == 1 & b == 4)
df %>% 
    filter(a > 2 | b == 4)
df %>% 
    filter(a %in% c(1, 3))
# remove by volumn values - use the ! operator
df %>% 
    filter(! a %in% c(1, 3))
# filter to missing values in column
df %>% 
    filter(is.na(a))
df %>% 
    distinct(a, b, .keep_all = TRUE)
df %>% 
    slice(1:3)
df %>% 
    slice(1, 3, 5)
idx <- c(2, 4, 6)
df %>% 
    slice(idx)
df %>% 
    slice_min(a)
df %>% 
    slice_max(a, n=2)
df %>% 
    slice_sample(n=5)  # sample without replacement
df %>% 
    slice_sample(n=5, replace=TRUE)  # sample with replacement
df <- tibble(
    a = c(1, 2, 3), 
    b = c(4, 5, 6)
)
# Add one or more new columns
df %>% 
    mutate(c = a + b)
df %>% 
    mutate(c = a + b, d = a - b, e = c * d)
# Modify existing columns
df %>% 
    mutate(a = a + 1)
df %>% 
    add_column(idx = 1:nrow(df))
df %>% 
    add_column(idx = 1:nrow(df), .before="a")
df %>% 
    add_row(a=4, b=7)
df %>% 
    add_row(a=4, b=7, .after=1)
df <- tibble(
    a = c(1, 2, 3, -9999), 
    b = c(4, -9999, 6, 8)
)
df <- df %>% 
    mutate_all(na_if, -9999) %>% 
    print()
df <- tibble(
    a = c(1, 2, 3, NA), 
    b = c(4, NA, 6, 8)
)
df %>% 
    drop_na()  # drop rows with missing values
df %>% 
    drop_na(a)  # drop rows in which column "a" has missing values
df %>% 
    replace_na(list(a = 0, b = 99))
df %>% 
    replace_na(as.list(colMeans(df, na.rm=TRUE)))
flights %>% 
    group_by(carrier) %>% 
    summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE)) %>%
    arrange(avg_dep_delay) 
flights %>% 
    group_by(origin, dest) %>%   # group by both origin and dest
    summarize(max.delay = max(arr_delay, na.rm=TRUE),
              avg.delay = mean(arr_delay, na.rm=TRUE),
              min.delay = min(arr_delay, na.rm=TRUE),
              count = n() )        # n() gives the group count
df %>% 
    group_by(b) %>% 
    mutate(mean_a = mean(a)) %>% 
    ungroup()
flights %>% 
    count(origin)
flights %>% 
    group_by(origin) %>% 
    summarize(n=n())
df1 <- tibble(a=c(1, 2, 3), b=c(4, 5, 6))
df2 <- tibble(a=c(4, 5, 6), b=c(7, 8, 9), c=c(10, 11, 12))

# bind rows
bind_rows(df1, df2)
df3 <- tibble(c = c(4, 5, 6), d = c(7, 8, 9))
bind_cols(df1, df3)
df4 <- tibble(a = c(4, 5, 6), c = c(7, 8, 9))
bind_cols(df1, df4)

References

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (April): 1–29. https://doi.org/10.18637/jss.v040.i01.