Chapter 7 Data preprocessing

We learned in the previous chapters, how to use dplyr to preprocess data. While this is useful for data exploration and cleanup, it is not enough for building models. For example, you may need to normalize predictors prior to the actual training step. The normalization transformation depends on the distribution of the training data and the exact same transformation needs to be applied to new data. Because of this, it is important to include preprocessing steps in the modeling pipeline.

The preprocessing steps can be used to:

  • create new features,
  • transform the data to make it more suitable for the model,
  • introduce non-linearity into the model
  • reduce the number of features, and
  • impute missing data
Preprocessing using `recipe`

Figure 6.1: Preprocessing using recipe

The tidymodels framework makes this easy. The preprocessing steps are defined using the recipe package and combined with the model using a pipeline that is created using the workflows package. In this chapter, we will learn how to use the recipe package to preprocess data and build models.

Load required packages:

Code
library(tidyverse)
library(tidymodels)
library(patchwork)
library(kableExtra)

7.1 Preprocessing data with recipes

Let’s use the mtcars dataset as an example. The mtcars dataset contains 32 observations (rows) and 11 variables (columns); check ?mtcars for details on the dataset. The goal is to predict the fuel consumption (mpg) of a car based on the other variables. Here is the dataset:

Code
data <- datasets::mtcars %>% as_tibble(rownames="car")
data %>% 
    head() %>% 
    knitr::kable()
car mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Two of the variables are categorical; transmission type (am) and engine shape (vs). We can also see that the continuous variables have different scales. For example, the displacement (wt) is in the hundreds while the number of cylinders (cyl) is in the single digits. Our plan for the preprocessing steps in the modeling pipeline is to:

  • Convert the categorical variables to factors
  • Normalize the continuous variables

We will use the recipe package to define the preprocessing steps. The first step is to create a recipe object using the recipe() function. The first argument is a formula that specifies the outcome variable and the predictors. The second argument is the data frame that contains the data. The recipe() function returns a recipe object that contains the preprocessing steps. The summary() function can be used to display the recipe object:

Code
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)
summary(rec_obj)
## # A tibble: 11 × 4
##    variable type      role      source  
##    <chr>    <list>    <chr>     <chr>   
##  1 cyl      <chr [2]> predictor original
##  2 disp     <chr [2]> predictor original
##  3 hp       <chr [2]> predictor original
##  4 drat     <chr [2]> predictor original
##  5 wt       <chr [2]> predictor original
##  6 qsec     <chr [2]> predictor original
##  7 vs       <chr [2]> predictor original
##  8 am       <chr [2]> predictor original
##  9 gear     <chr [2]> predictor original
## 10 carb     <chr [2]> predictor original
## 11 mpg      <chr [2]> outcome   original

The output tells us which variables are included in the model and what their respective role is. The role is just a label that is used to identify the variables. For our use, the automatically assigned roles of predictor and outcome are fine.

Now that we have a recipe, we can add preprocessing steps. The functions have the general format step_{X},

rec_obj <- step_{X}(rec_obj, ..., arguments)    ## or
rec_obj <- rec_obj %>% step_{X}(..., arguments)

The ... stands for a selection of variables. This could either be a list of variable names or a selector like all_predictors, all_numeric, or similar ones. More about this later. The remaining arguments are keyword arguments and require specifying the name of the argument.

The function step_num2factor converts a numerical column to a factor column. The first argument is the recipe object. The second argument is the name of the variable to be converted. The levels argument is used to specify the levels of the factor. The transform argument is used to specify a function that is applied to the variable before it is converted to a factor.

Code
rec_obj <- rec_obj %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual"))

The levels array is used to map the number in the column vs to a string. The values of vs are 0 and 1, so a simple lookup won’t work. We need to first transform the value before we can use it as an index into the levels array. This is done using the transform function. For a value of 0 is changed to 1 by the transform function and then used as a index to look up the string “V-shaped” in the levels array. Similarly, a value of 1 is changed to 2 and then through lookup converted to “straight”. If your values are already mapping to the correct indices, you can omit the transform argument. Finally, the whole column is changed to a factor. The second step does a similar transformation of the am column.

We can look at the result of the recipe so far using the prep() and bake() functions. The prep() function trains the steps using, in this case, the data. You can use the training argument to specify a different dataset. The bake() function applies the recipe to the data. The new_data argument is used to specify a different dataset. If the new_data argument is omitted, the recipe is applied to the data that was used to train the recipe.

Code
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
## Selecting by mpg
## # A tibble: 4 × 11
##     cyl  disp    hp  drat    wt  qsec vs       am      gear  carb   mpg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>    <fct>  <dbl> <dbl> <dbl>
## 1     4  78.7    66  4.08  2.2   19.5 straight manual     4     1  32.4
## 2     4  75.7    52  4.93  1.62  18.5 straight manual     4     2  30.4
## 3     4  71.1    65  4.22  1.84  19.9 straight manual     4     1  33.9
## 4     4  95.1   113  3.77  1.51  16.9 straight manual     5     2  30.4

Applying the recipe to the dataset results in a tibble where the columns vs and am are now factors.

The next step is to normalize the continuous variables. The step_normalize() function is used to normalize the variables.

Code
rec_obj <- rec_obj %>%
    step_normalize(all_numeric_predictors())
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
## Selecting by mpg
## # A tibble: 4 × 11
##     cyl  disp     hp  drat    wt   qsec vs       am      gear   carb   mpg
##   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <fct>    <fct>  <dbl>  <dbl> <dbl>
## 1 -1.22 -1.23 -1.18  0.904 -1.04  0.907 straight manual 0.424 -1.12   32.4
## 2 -1.22 -1.25 -1.38  2.49  -1.64  0.376 straight manual 0.424 -0.503  30.4
## 3 -1.22 -1.29 -1.19  1.17  -1.41  1.15  straight manual 0.424 -1.12   33.9
## 4 -1.22 -1.09 -0.491 0.324 -1.74 -0.531 straight manual 1.78  -0.503  30.4

Here, we use the all_numeric_predictors() selector to specify all the numeric variables that are labeled as predictor. During the prep step, the mean and standard deviation of the variables are computed and stored with the recipe. The values are used in the bake step to transform the data. As we can see, the continuous variables are now normalized. Had we used all_numeric instead of all_numeric_predictors, the outcome variable mpg would have been normalized as well.

To summarize, the recipe for preprocessing the data is:

Code
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)  %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
    step_normalize(all_numeric_predictors())

Todo:

Now is a good time to look through the reference of the recipe package to get an overview of what is available.

In the following, we highlight some of the more commonly used functions.

7.2 Transformations of individual features

The following steps apply numerical transformations

  • step_inverse: \(f(x) = 1/x\)
  • step_invlogit: \(f(x) = 1/(1+exp(-x))\)
  • step_log: \(f(x) = log(x)\)
  • step_logit: \(f(x) = log(x/(1-x))\)
  • step_sqrt: \(f(x) = \sqrt{x}\)

The step_mutate() can be used like the dplyr::mutate function.

The Box-Cox transformation and the Yeo-Johnson transformation can be used to transform skewed data to have a more normal distribution (see wikipedia).

  • step_BoxCox: Box-Cox transformation for non-negative data
  • step_YeoJohnson: Yeo-Johnson transformation

All these steps transform a single column and replace the column with the transformed value. This is different for the step_poly function.

Code
xy <- tibble(
    x=seq(-1, 1, length.out=100), 
    y=seq(-1, 1, length.out=100)
)

transformed <- recipe(y~x, data=xy)  %>%
    step_poly(x, degree=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
y x_poly_1 x_poly_2 x_poly_3
-1.0000 -0.1715 0.2170 -0.2492
-0.9798 -0.1680 0.2038 -0.2190
-0.9596 -0.1646 0.1910 -0.1903
-0.9394 -0.1611 0.1783 -0.1632
-0.9192 -0.1576 0.1660 -0.1375
-0.8990 -0.1542 0.1539 -0.1132

The result of applying the step_poly function is three new columns x_poly_1, x_poly_2, and x_poly_3. The columns contain the first three orthogonal polynomials of the x column. The degree argument specifies the degree of the polynomials, here 3. The role argument is used to specify the role of the new columns. The default is predictor.

Figure 7.1 shows the effect of applying the step_poly function to the x column. The first polynomial x_poly_1 is linear, x_poly_2 is a transformation with a quadratic function, and x_poly_3 is a cubic function. The polynomials are orthogonal, which means that they are uncorrelated. This is useful when using the polynomials as predictors in a regression model.

Code
ggplot(transformed, aes(x=y, y=x_poly_1)) +
    geom_line() +
    geom_line(aes(y=x_poly_2), color="red") +
    geom_line(aes(y=x_poly_3), color="darkgreen") +
    labs(x="x", y="x_poly_i")
Orthogonal polynomials created using `step_poly`

Figure 7.1: Orthogonal polynomials created using step_poly

Finally, there are several functions to convert a column to a variety of splines.

  • step_ns: Natural spline basis functions
  • step_bs: B-spline basis functions
  • step_spline_b: Basis splines
  • step_spline_convex: Convex splines
  • step_spline_monotone: Monotone splines
  • step_spline_natural: Natural splines
  • step_spline_nonnegative: Non-negative splines

7.3 Discretizing numeric variables

Sometimes, it can be useful to discretize a numeric variable, this means, convert the numeric values into a set of factors. This can be used for stepwise linear regression. The step_discretize function will convert a numeric variable into a set of factors using the quantiles of the variable.

Code
transformed <- recipe(y~x, data=xy)  %>%
    step_discretize(x, num_breaks=5) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
x y
bin1 -1.0000
bin1 -0.9798
bin1 -0.9596
bin1 -0.9394
bin1 -0.9192
bin1 -0.8990

By default, step_discretize will create four factors. Here, we specify num_breaks=5 to create five factors. Figure 7.2 shows the effect of applying the step_discretize function.

Code
ggplot(transformed, aes(x=y, y=x)) +
    geom_point()
Factor levels created using `step_discretize`

Figure 7.2: Factor levels created using step_discretize

An alternative to using quantiles is to specify the breaks explicitly using the step_cut function.

Code
breaks = c(-1.1, -0.8, 0.6, 0.7, 0.75)
transformed <- recipe(y~x, data=xy)  %>%
    step_cut(x, breaks=breaks) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
x y
[-1.1,-0.8] -1.0000
[-1.1,-0.8] -0.9798
[-1.1,-0.8] -0.9596
[-1.1,-0.8] -0.9394
[-1.1,-0.8] -0.9192
[-1.1,-0.8] -0.8990

Figure 7.3 shows the effect of applying the step_cut function.

Code
ggplot(transformed, aes(x=y, y=x)) +
    geom_vline(xintercept=breaks, color="grey") +
    geom_point()
Factor levels created using `step_cut`

Figure 7.3: Factor levels created using step_cut

During training, the range of the data will be used to determine the left and right boundaries of the bins. If a new data point falls outside this range, the value will be mapped to NA. will cause problems when predicting new data. To avoid this, we can use the include_outside_range argument to specify that values outside the range will be assigned to the first or last bin.

Code
transformed <- recipe(y~x, data=xy)  %>%
    step_cut(x, breaks=breaks, include_outside_range=TRUE) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
x y
[min,-0.8] -1.0000
[min,-0.8] -0.9798
[min,-0.8] -0.9596
[min,-0.8] -0.9394
[min,-0.8] -0.9192
[min,-0.8] -0.8990

You can see that the lowest range is now labeled as [min,-0.8].

7.4 Data normalization

Several model methods require data to be on the same scale. For example, assume a case where one property has a values in the 1000s, while another property has values between 0 and 10. In a \(k\)-nearest neighbor model the first property will dominate any distance measure while the second property will have little influence. To avoid this, we can normalize the data. The step_normalize function is used to normalize the data.

Code
rec_obj <- recipe(formula, data=data)  %>%
    step_normalize(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=3)
cyl disp hp drat wt qsec vs am gear carb mpg
-0.105 -0.571 -0.535 0.568 -0.610 -0.777 -0.868 1.190 0.424 0.735 21.0
-0.105 -0.571 -0.535 0.568 -0.350 -0.464 -0.868 1.190 0.424 0.735 21.0
-1.225 -0.990 -0.783 0.474 -0.917 0.426 1.116 1.190 0.424 -1.122 22.8
-0.105 0.220 -0.535 -0.966 -0.002 0.890 1.116 -0.814 -0.932 -1.122 21.4
1.015 1.043 0.413 -0.835 0.228 -0.464 -0.868 -0.814 -0.932 -0.503 18.7
-0.105 -0.046 -0.608 -1.565 0.248 1.327 1.116 -0.814 -0.932 -1.122 18.1

Normalization will shift and scale each numerical column, so that its mean is 0 and the standard deviation is 1.

An alternative to normalization is set_range. In this case, the data will be transformed to fall into a given range.

Code
rec_obj <- recipe(formula, data=data)  %>%
    step_range(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=3)
cyl disp hp drat wt qsec vs am gear carb mpg
0.5 0.222 0.205 0.525 0.283 0.233 0 1 0.5 0.429 21.0
0.5 0.222 0.205 0.525 0.348 0.300 0 1 0.5 0.429 21.0
0.0 0.092 0.145 0.502 0.206 0.489 1 1 0.5 0.000 22.8
0.5 0.466 0.205 0.147 0.435 0.588 1 0 0.0 0.000 21.4
1.0 0.721 0.435 0.180 0.493 0.300 0 0 0.0 0.143 18.7
0.5 0.384 0.187 0.000 0.498 0.681 1 0 0.0 0.000 18.1

The default range is [0,1]. You can specify a different range using the min and max argument.

Useful to know:

While methods like nearest neighbor` require normalization to work properly, other methods are not affected by the scale of the data. For example, decision trees handle each variable independently. However, it can still be beneficial to bring data to the same scale for numerical efficiency and stability.

7.5 Imputing missing data

If you expect your future data to have missing data, it will be useful to derive a strategy to deal with missing data not only for your training data but also for new data. The family of step_impute_* functions provide a variety of imputation strategies that are trained on the training data and applied to new data. To demonstrate this functionality, we will create a new dataset that contains missing values.

Code
set.seed(123)
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate_at(vars(cyl, wt, am), 
              function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

missing_cyl <- is.na(data['cyl'])
missing_wt <- is.na(data['wt'])
missing_am <- is.na(data['am'])
missing_rows = missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>% 
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
car mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 NA 18.61 straight manual 4 1
Valiant 18.1 NA 225.0 105 2.76 3.46 20.22 straight automatic 3 1
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 straight NA 4 4
Fiat 128 32.4 NA 78.7 66 4.08 2.20 19.47 straight manual 4 1
Honda Civic 30.4 4 75.7 52 4.93 NA 18.52 straight manual 4 2
Ferrari Dino 19.7 6 145.0 175 3.62 NA 15.50 V-shaped manual 5 6

The mutate_at function adds about 10% missing data to the columns cyl, wt, and am.

For continuous numeric data, the mean and median are the most common imputation strategies (step_impute_mean or step_impute_median). For nominal data, the most common value is used (step_impute_mode).

Code
transformed <- recipe(mpg~., data=data)  %>%
    step_impute_mean(wt) %>%
    step_impute_median(cyl) %>%
    step_impute_mode(am) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_rows, ] %>% 
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
car cyl disp hp drat wt qsec vs am gear carb mpg
Datsun 710 4 108.0 93 3.85 3.319 18.61 straight manual 4 1 22.8
Valiant 6 225.0 105 2.76 3.460 20.22 straight automatic 3 1 18.1
Merc 280 6 167.6 123 3.92 3.440 18.30 straight automatic 4 4 19.2
Fiat 128 6 78.7 66 4.08 2.200 19.47 straight manual 4 1 32.4
Honda Civic 4 75.7 52 4.93 3.319 18.52 straight manual 4 2 30.4
Ferrari Dino 6 145.0 175 3.62 3.319 15.50 V-shaped manual 5 6 19.7

In our example, this added the values 3.3188621 to the missing values in the wt column, 6 to the missing values in the cyl column, and numeric to the missing values in the am column.

In some cases, a better approach is to use a model to impute the missing values.

  • step_impute_linear: Impute numeric variables via a linear model
  • step_impute_bag: Impute via bagged trees
  • step_impute_knn: Impute via k-nearest neighbors

We known from exploratory data that the wt columns is correlated with the disp and hp columns. We can use this information to impute the missing values in the wt column.

Code
transformed <- recipe(mpg~., data=data)  %>%
    step_impute_linear(wt, impute_with=imp_vars(disp, hp)) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_wt, ] %>%
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
car cyl disp hp drat wt qsec vs am gear carb mpg
Datsun 710 4 108.0 93 3.85 2.390 18.61 straight manual 4 1 22.8
Honda Civic 4 75.7 52 4.93 2.231 18.52 straight manual 4 2 30.4
Ferrari Dino 6 145.0 175 3.62 2.492 15.50 V-shaped manual 5 6 19.7

Here we used the step_impute_linear function to impute the missing values in the wt column. The impute_with argument is used to specify the variables that are used to impute the missing values. The imp_vars function is required here to specify the variables. We can see that in this case, the imputed values are different.

7.6 Dummy variables

Most methods cannot handle categorical variables without preprocessing. A common approach is to convert a categorical variable with \(C\) levels into several new columns that contain only 1 and 0 values. If reference cell parametrization is used, \(C-1\) new columns are created for all but the first factor level. In recipes, we use the function step_dummy to create these new dummy variables. Here is an example:

Code
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>% 
    sample_frac()
recipe(~ species, data=penguins) %>%
    step_dummy(species, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>% 
    head()
## # A tibble: 6 × 3
##   species   species_Chinstrap species_Gentoo
##   <fct>                 <dbl>          <dbl>
## 1 Gentoo                    0              1
## 2 Adelie                    0              0
## 3 Adelie                    0              0
## 4 Chinstrap                 1              0
## 5 Adelie                    0              0
## 6 Gentoo                    0              1

We set the argument keep_original_cols to TRUE to include the original variable. By default, it would be removed. We can see that the step creates two new variables species_Chinstrap and species_Gentoo. They have values of 0 or 1, depending the value of species. If the species is Gentoo, species_Gentoo is set to 1 and the other variable to 0. For Chinstrap it is the other way round. For Adelie, both values are set to 0. Adelie is the reference value. This type of encoding is used for models that cannot deal with correlated data.

An alternative is one hot encoding. In this case, we create new columns for each factor level.

Code
recipe(~ species, data=penguins) %>%
    step_dummy(species, one_hot=TRUE, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>% 
    head()
## # A tibble: 6 × 4
##   species   species_Adelie species_Chinstrap species_Gentoo
##   <fct>              <dbl>             <dbl>          <dbl>
## 1 Gentoo                 0                 0              1
## 2 Adelie                 1                 0              0
## 3 Adelie                 1                 0              0
## 4 Chinstrap              0                 1              0
## 5 Adelie                 1                 0              0
## 6 Gentoo                 0                 0              1

Setting one_hot=TRUE creates the additional column species_Adelie which is set to 1 for species Adelie. One hot encoding is usually used for \(k\)-NN and neural networks to treat each factor equivalently. As can be seen in 7.4, with one hot encoding Euclidean distances between the different factor levels are identical. With reference cell encoding, the reference level has the same distance to all other levels and this distance is shorter than the distances between the other levels.

Code
knitr::include_graphics("images/preprocess_dummy.png")
Effect of approach to generate dummy variables on distances. Left: reference cell encoding, right: one hot encoding.

Figure 7.4: Effect of approach to generate dummy variables on distances. Left: reference cell encoding, right: one hot encoding.

To convert all nominal or categorical predictors into dummy variables use:

step_dummy(all_nominal_predictors())

See Handling categorical predictors for more details on handling categorical variables in tidymodels using recipes.

7.7 Interactions

The recipe package has also a way of defining interaction terms. While this could be done using a formula, the step_interact function is particularly useful to define interactions with variables that were created in a previous step.

Here is an example, where we first convert the vs predictor into a dummy variable using one-hot-encoding. this replaces the vs predictor with vs_V.shaped and vs_straight. In the next step, we want to create interaction terms of these two predictors with hp. We define this using the formula ~ (vs_V.shaped + vs_straight):hp. If the factor has several levels, it will be more concise to select the predictors using `starts_with(“vs”)’.

Code
transformed <- recipe(mpg~vs+hp, data=data)  %>%
    step_dummy(vs, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):hp) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head(3)
## # A tibble: 3 × 6
##      hp   mpg vs_V.shaped vs_straight vs_V.shaped_x_hp vs_straight_x_hp
##   <dbl> <dbl>       <dbl>       <dbl>            <dbl>            <dbl>
## 1   110  21             1           0              110                0
## 2   110  21             1           0              110                0
## 3    93  22.8           0           1                0               93

We can see that step_interact creates two new columns, vs_V.shaped_x_hp and vs_straight_x_hp. The interacting terms are separated using _x_ (sep argument).

The following example adds interaction terms between two factors that were one-hot-encoded.

Code
transformed <- recipe(mpg~vs+am, data=data)  %>%
    step_dummy(vs, am, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):starts_with("am")) %>%
    prep() %>%
    bake(new_data=NULL)
## Warning: ! There are new levels in a factor: `NA`.
Code
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
mpg vs_V.shaped vs_straight am_automatic am_manual vs_V.shaped_x_am_automatic vs_V.shaped_x_am_manual vs_straight_x_am_automatic vs_straight_x_am_manual
21.0 1 0 0 1 0 1 0 0
21.0 1 0 0 1 0 1 0 0
22.8 0 1 0 1 0 0 0 1
21.4 0 1 1 0 0 0 1 0
18.7 1 0 1 0 1 0 0 0
18.1 0 1 1 0 0 0 1 0

This adds four new columns.

You can also create all possible interactions by using mpg ~ .:., mpg ~ .**2 or mpg ~ .^2 where the . represents all remaining columns after removing mpg.

Code
transformed <- recipe(mpg~hp + wt + vs, data=data)  %>%
    step_interact(mpg ~ .^2) %>%
    # step_interact(mpg ~ .:.) %>%
    # step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
hp wt vs mpg hp_x_wt hp_x_vsstraight wt_x_vsstraight
110 2.620 V-shaped 21.0 288.20 0 0.000
110 2.875 V-shaped 21.0 316.25 0 0.000
93 NA straight 22.8 NA 93 NA
110 3.215 straight 21.4 353.65 110 3.215
175 3.440 V-shaped 18.7 602.00 0 0.000
105 3.460 straight 18.1 363.30 105 3.460

Using the formula ~ all_numeric_predictors()**2 will only create interaction terms between numerical predictors

Code
transformed <- recipe(mpg~hp + disp + vs, data=data)  %>%
    # step_interact(mpg ~ .^2) %>%  <= includes vs
    # step_interact(mpg ~ .:.) %>%  <= includes vs
    step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")
Table 7.1: Table created by adding all interaction between numerical predictors
hp disp vs mpg hp_x_disp
110 160 V-shaped 21.0 17600
110 160 V-shaped 21.0 17600
93 108 straight 22.8 10044
110 258 straight 21.4 28380
175 360 V-shaped 18.7 63000
105 225 straight 18.1 23625

Note, this will not add quadratic terms.

Useful to know:

Table 7.1 shows the results of adding an interaction between hp and disp. The original predictors have ranges of hp=[52, 335] and disp=[71.1, 472]. Both ranges are similar. The interaction term however has a much wider and larger range hp_x_disp=[ 3936.4, 101200.0].

If you use a model that is based on distances like \(k\)-NN, it is important to normalize the data (see 7.4). Otherwise, the interaction term will dominate the distances and reduce the influence the main terms can have on the model.

7.8 Principal components

The recipe package has several functions that combine multiple columns. Here, we will only discuss the step_pca function. It is used to create principal components. For more information on PCA, see section 18.1 It is recommended to use the step_normalize function prior to using step_pca.

Code
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

transformed <- recipe(mpg~., data=data)  %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3) 
car vs am mpg PC1 PC2 PC3
Mazda RX4 V-shaped manual 21.0 -0.647 1.183 0.270
Mazda RX4 Wag V-shaped manual 21.0 -0.622 0.986 -0.061
Datsun 710 straight manual 22.8 -2.308 -0.293 0.352
Hornet 4 Drive straight automatic 21.4 -0.155 -1.981 0.281
Hornet Sportabout V-shaped automatic 18.7 1.628 -0.857 0.933
Valiant straight automatic 18.1 -0.107 -2.437 -0.058

The predictors are now reduced to three numerical columns called PC1, PC2, and PC3. The outcome mpg and the two categorical predictors vs and am are left unchanged.

7.9 Filtering variables

So far, we covered preprocessing steps that transform columns into one or more columns. The recipe package also contains methods to remove columns from the dataset. The most basic ones are step_rm and step_select, which remove one or more column from the dataset by name.

Other filters take the information in the column into account. step_filter_missing removes columns where the number of missing data surpasses a given threshold. This is useful for columns where imputation is not feasible.

The step_zv and step_nzv functions remove columns that are constant or almost constant. Such columns contain in general little information and can be removed without limiting the performance of models.

Another source for redundant information are columns that are highly correlated with other columns or columns that are linear combinations of other columns. In some cases, leaving these columns in the dataset can cause numerical problems. The step_corr function removes columns that are highly correlated with other columns. The step_lincomb function removes columns that are linear combinations of other columns.

Further information:

The tidymodels package parsnip is the package that is responsible to define and fit models. You find detailed information about each of the model types, the specific engines and their options in the documentation.

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_recipe.png")
library(tidyverse)
library(tidymodels)
library(patchwork)
library(kableExtra)
data <- datasets::mtcars %>% as_tibble(rownames="car")
data %>% 
    head() %>% 
    knitr::kable()
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)
summary(rec_obj)
rec_obj <- rec_obj %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual"))
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
rec_obj <- rec_obj %>%
    step_normalize(all_numeric_predictors())
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)  %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
    step_normalize(all_numeric_predictors())
xy <- tibble(
    x=seq(-1, 1, length.out=100), 
    y=seq(-1, 1, length.out=100)
)

transformed <- recipe(y~x, data=xy)  %>%
    step_poly(x, degree=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x_poly_1)) +
    geom_line() +
    geom_line(aes(y=x_poly_2), color="red") +
    geom_line(aes(y=x_poly_3), color="darkgreen") +
    labs(x="x", y="x_poly_i")
transformed <- recipe(y~x, data=xy)  %>%
    step_discretize(x, num_breaks=5) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
    geom_point()
breaks = c(-1.1, -0.8, 0.6, 0.7, 0.75)
transformed <- recipe(y~x, data=xy)  %>%
    step_cut(x, breaks=breaks) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
    geom_vline(xintercept=breaks, color="grey") +
    geom_point()
transformed <- recipe(y~x, data=xy)  %>%
    step_cut(x, breaks=breaks, include_outside_range=TRUE) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=4) %>% 
    kableExtra::kable_styling(full_width=FALSE)
rec_obj <- recipe(formula, data=data)  %>%
    step_normalize(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=3)
rec_obj <- recipe(formula, data=data)  %>%
    step_range(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% 
    knitr::kable(digits=3)
set.seed(123)
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate_at(vars(cyl, wt, am), 
              function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

missing_cyl <- is.na(data['cyl'])
missing_wt <- is.na(data['wt'])
missing_am <- is.na(data['am'])
missing_rows = missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>% 
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
transformed <- recipe(mpg~., data=data)  %>%
    step_impute_mean(wt) %>%
    step_impute_median(cyl) %>%
    step_impute_mode(am) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_rows, ] %>% 
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
transformed <- recipe(mpg~., data=data)  %>%
    step_impute_linear(wt, impute_with=imp_vars(disp, hp)) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_wt, ] %>%
    knitr::kable(digits=3) %>% 
    scroll_box(width = "100%")
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>% 
    sample_frac()
recipe(~ species, data=penguins) %>%
    step_dummy(species, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>% 
    head()
recipe(~ species, data=penguins) %>%
    step_dummy(species, one_hot=TRUE, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>% 
    head()
knitr::include_graphics("images/preprocess_dummy.png")
transformed <- recipe(mpg~vs+hp, data=data)  %>%
    step_dummy(vs, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):hp) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head(3)
transformed <- recipe(mpg~vs+am, data=data)  %>%
    step_dummy(vs, am, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):starts_with("am")) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + wt + vs, data=data)  %>%
    step_interact(mpg ~ .^2) %>%
    # step_interact(mpg ~ .:.) %>%
    # step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + disp + vs, data=data)  %>%
    # step_interact(mpg ~ .^2) %>%  <= includes vs
    # step_interact(mpg ~ .:.) %>%  <= includes vs
    step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

transformed <- recipe(mpg~., data=data)  %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3)