Chapter 1 Data loading and cleaning

The objective of exploratory data anlysis is to understand your data, find patterns, identify outliers, and possibly form hypotheses for further analysis. In this chapter, we will use the penguins dataset to learn how to load data, inspect it, and process it by removing missing values and outliers. The following Chapter 2 will cover more details on data transformation. Chapter 3 will cover data visualization.

Let’s start with loading the tidyverse packages:

Code
library(tidyverse)

1.1 Load data

We use the function readr::read_csv to load the data. This function and its variations like readr::read_delim are able to read a wide variety of formats. Here, we read a compressed file in comma separated (csv) format.

Code
penguins <- readr::read_csv("data/penguins_modified.csv.gz")
## Rows: 347 Columns: 7
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (4): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We learn that the data set contains 347 observations with 7 columns.1 Three of the columns are string (chr: species, island, sex), the rest are numeric (dbl: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g).

Note that the command read_csv is spelt similarly to the base-R function read.csv. To avoid confusion, we use the notation readr::read_csv to indicate that we use the function read_csv from the package readr.

Todo:

Read the help page for readr::read_csv and find out what the arguments file, delim, comment, and skip do.

1.2 Inspect data to identify missing values

The data is now stored in the variable penguins as a tibble. A tibble is the tidyverse form of a data frame. It is a bit more strict than a data frame, but also more consistent. Let’s have a look at the first few rows of the data:

Code
penguins
## # A tibble: 347 × 7
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 337 more rows
## # ℹ 1 more variable: sex <chr>

Compared to standard data frames, this is a more informative representation of your data. It shows the first few rows, the column names, and the data types of the columns. The data types are important, because they determine what you can do with the data.

A useful first step is to check for missing data. We can already see in the output above, that the dataset contains missing values indicated by NA. There are various ways of doing this. One way is to use the function is.na to check for missing values in each column:

Code
colSums(is.na(penguins))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex 
##                 2                 2                11

Several columns contain missing values. Here are the rows with missing values:

Code
penguins[rowSums(is.na(penguins)) > 0, ]
## # A tibble: 11 × 7
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # ℹ 1 more variable: sex <chr>

With just a few missing values in the dataset, we can simply remove them. This is done with the function drop_na:

Code
df <- penguins %>% 
    drop_na()

The pipe operator %>% (pronounced as then) is used to chain commands together. The command penguins %>% drop_na() takes the data frame penguins and passes it to the function drop_na. The statement is equivalent to:

df <- drop_na(penguins)

The real power of the pipe operator becomes apparent when we combined multiple commands in a processing pipeline.

1.3 Identifying outliers

The tidyverse comes with a powerful plotting package called ggplot2. We will use the ggplot2 package throughout this class. To get started, we look at the values of the numerical columns to get a feel for the data. Here is a graph (Figure 1.1 of the values of the bill_length_mm column:

Code
df <- penguins %>% 
    drop_na() %>%
    mutate(id = row_number())
ggplot(df, aes(x=id, y=bill_length_mm)) + geom_line()
Values of the `bill_length_mm` column.

Figure 1.1: Values of the bill_length_mm column.

We first create a temporary tibble df that contains the id column and the bill_length_mm column. The id column is just a sequence of numbers from 1 to 347. We use this column to plot the values of bill_length_mm against the row number. The geom_line function adds a line plot to the plot. The result is a line plot of the values of bill_length_mm against the row number. We can see that the values are mostly between 30 and 60, but there are some outliers. The values are:

Code
penguins %>%
    mutate(id = row_number(), .before=species) %>%
    slice(19:25) %>%
    knitr::kable()
id species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
19 Adelie Torgersen 34.4 18.4 184 3325 female
20 Adelie Torgersen 46.0 21.5 194 4200 male
21 Adelie Torgersen 391.0 18.7 181 3750 male
22 Adelie Torgersen 395.0 17.4 186 3800 female
23 Adelie Torgersen 403.0 18.0 195 3250 female
24 Adelie Biscoe 37.8 18.3 174 3400 female
25 Adelie Biscoe 37.7 18.7 180 3600 male

We can see that rows 21, 22, and 23 have values that are around 10 times higher than all other values. It’s likely that these values are due to an error in the data collection process. We can remove these values with the following command:

Code
df <- penguins %>% 
    drop_na() %>%
    filter(bill_length_mm < 100)

Todo:

Check if any of the other numerical columns contain outliers.

1.4 Summary

In this chapter, we have learned how to load data, inspect it, and process it. We have also learned how to remove missing values and outliers.

The analysis of this chapter results in the following processing pipeline:

Code
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>%
    drop_na() %>%
    filter(bill_length_mm < 100)
## Rows: 347 Columns: 7
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (4): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Further information:

The data import cheatsheet is a short summary of all the main features of readr and readxl. For more details see https://readr.tidyverse.org/. The readxl package allows to import Excel files, see https://readxl.tidyverse.org/.

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)
penguins <- readr::read_csv("data/penguins_modified.csv.gz")
penguins
colSums(is.na(penguins))
penguins[rowSums(is.na(penguins)) > 0, ]
df <- penguins %>% 
    drop_na()
df <- penguins %>% 
    drop_na() %>%
    mutate(id = row_number())
ggplot(df, aes(x=id, y=bill_length_mm)) + geom_line()
penguins %>%
    mutate(id = row_number(), .before=species) %>%
    slice(19:25) %>%
    knitr::kable()
df <- penguins %>% 
    drop_na() %>%
    filter(bill_length_mm < 100)
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>%
    drop_na() %>%
    filter(bill_length_mm < 100)

  1. Note: we use a modified version of the original dataset here↩︎