Chapter 7 Data preprocessing

We learned in the previous chapters, how to use dplyr to preprocess data. While this is useful for data exploration and cleanup, it is not enough for building models. For example, you may need to normalize predictors prior to the actual training step. The normalization transformation depends on the distribution of the training data and the exact same transformation needs to be applied to new data. Because of this, it is important to include preprocessing steps in the modeling pipeline.

The preprocessing steps can be used to:

create new features,
transform the data to make it more suitable for the model,
introduce non-linearity into the model
reduce the number of features, and
impute missing data

Figure 6.1: Preprocessing using recipe

The tidymodels framework makes this easy. The preprocessing steps are defined using the recipe package and combined with the model using a pipeline that is created using the workflows package. In this chapter, we will learn how to use the recipe package to preprocess data and build models.

Load required packages:

Code

library(tidyverse)
library(tidymodels)
library(patchwork)
library(kableExtra)

7.1 Preprocessing data with recipes

Let’s use the mtcars dataset as an example. The mtcars dataset contains 32 observations (rows) and 11 variables (columns); check ?mtcars for details on the dataset. The goal is to predict the fuel consumption (mpg) of a car based on the other variables. Here is the dataset:

Code

data <- datasets::mtcars %>% as_tibble(rownames="car")
data %>%
    head() %>%
    knitr::kable()

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Two of the variables are categorical; transmission type (am) and engine shape (vs). We can also see that the continuous variables have different scales. For example, the displacement (wt) is in the hundreds while the number of cylinders (cyl) is in the single digits. Our plan for the preprocessing steps in the modeling pipeline is to:

Convert the categorical variables to factors
Normalize the continuous variables

We will use the recipe package to define the preprocessing steps. The first step is to create a recipe object using the recipe() function. The first argument is a formula that specifies the outcome variable and the predictors. The second argument is the data frame that contains the data. The recipe() function returns a recipe object that contains the preprocessing steps. The summary() function can be used to display the recipe object:

Code

formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)
summary(rec_obj)

## # A tibble: 11 × 4
##    variable type      role      source  
##    <chr>    <list>    <chr>     <chr>   
##  1 cyl      <chr [2]> predictor original
##  2 disp     <chr [2]> predictor original
##  3 hp       <chr [2]> predictor original
##  4 drat     <chr [2]> predictor original
##  5 wt       <chr [2]> predictor original
##  6 qsec     <chr [2]> predictor original
##  7 vs       <chr [2]> predictor original
##  8 am       <chr [2]> predictor original
##  9 gear     <chr [2]> predictor original
## 10 carb     <chr [2]> predictor original
## 11 mpg      <chr [2]> outcome   original

The output tells us which variables are included in the model and what their respective role is. The role is just a label that is used to identify the variables. For our use, the automatically assigned roles of predictor and outcome are fine.

Now that we have a recipe, we can add preprocessing steps. The functions have the general format step_{X},

rec_obj <- step_{X}(rec_obj, ..., arguments)    ## or
rec_obj <- rec_obj %>% step_{X}(..., arguments)

The ... stands for a selection of variables. This could either be a list of variable names or a selector like all_predictors, all_numeric, or similar ones. More about this later. The remaining arguments are keyword arguments and require specifying the name of the argument.

The function step_num2factor converts a numerical column to a factor column. The first argument is the recipe object. The second argument is the name of the variable to be converted. The levels argument is used to specify the levels of the factor. The transform argument is used to specify a function that is applied to the variable before it is converted to a factor.

Code

rec_obj <- rec_obj %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual"))

The levels array is used to map the number in the column vs to a string. The values of vs are 0 and 1, so a simple lookup won’t work. We need to first transform the value before we can use it as an index into the levels array. This is done using the transform function. For a value of 0 is changed to 1 by the transform function and then used as a index to look up the string “V-shaped” in the levels array. Similarly, a value of 1 is changed to 2 and then through lookup converted to “straight”. If your values are already mapping to the correct indices, you can omit the transform argument. Finally, the whole column is changed to a factor. The second step does a similar transformation of the am column.

We can look at the result of the recipe so far using the prep() and bake() functions. The prep() function trains the steps using, in this case, the data. You can use the training argument to specify a different dataset. The bake() function applies the recipe to the data. The new_data argument is used to specify a different dataset. If the new_data argument is omitted, the recipe is applied to the data that was used to train the recipe.

Code

rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)

## Selecting by mpg

## # A tibble: 4 × 11
##     cyl  disp    hp  drat    wt  qsec vs       am      gear  carb   mpg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>    <fct>  <dbl> <dbl> <dbl>
## 1     4  78.7    66  4.08  2.2   19.5 straight manual     4     1  32.4
## 2     4  75.7    52  4.93  1.62  18.5 straight manual     4     2  30.4
## 3     4  71.1    65  4.22  1.84  19.9 straight manual     4     1  33.9
## 4     4  95.1   113  3.77  1.51  16.9 straight manual     5     2  30.4

Applying the recipe to the dataset results in a tibble where the columns vs and am are now factors.

The next step is to normalize the continuous variables. The step_normalize() function is used to normalize the variables.

Code

rec_obj <- rec_obj %>%
    step_normalize(all_numeric_predictors())
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)

## Selecting by mpg

## # A tibble: 4 × 11
##     cyl  disp     hp  drat    wt   qsec vs       am      gear   carb   mpg
##   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <fct>    <fct>  <dbl>  <dbl> <dbl>
## 1 -1.22 -1.23 -1.18  0.904 -1.04  0.907 straight manual 0.424 -1.12   32.4
## 2 -1.22 -1.25 -1.38  2.49  -1.64  0.376 straight manual 0.424 -0.503  30.4
## 3 -1.22 -1.29 -1.19  1.17  -1.41  1.15  straight manual 0.424 -1.12   33.9
## 4 -1.22 -1.09 -0.491 0.324 -1.74 -0.531 straight manual 1.78  -0.503  30.4

Here, we use the all_numeric_predictors() selector to specify all the numeric variables that are labeled as predictor. During the prep step, the mean and standard deviation of the variables are computed and stored with the recipe. The values are used in the bake step to transform the data. As we can see, the continuous variables are now normalized. Had we used all_numeric instead of all_numeric_predictors, the outcome variable mpg would have been normalized as well.

To summarize, the recipe for preprocessing the data is:

Code

formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)  %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
    step_normalize(all_numeric_predictors())

Todo:

Now is a good time to look through the reference of the recipe package to get an overview of what is available.

In the following, we highlight some of the more commonly used functions.

7.2 Transformations of individual features

The following steps apply numerical transformations

step_inverse: \(f(x) = 1/x\)
step_invlogit: \(f(x) = 1/(1+exp(-x))\)
step_log: \(f(x) = log(x)\)
step_logit: \(f(x) = log(x/(1-x))\)
step_sqrt: \(f(x) = \sqrt{x}\)

The step_mutate() can be used like the dplyr::mutate function.

The Box-Cox transformation and the Yeo-Johnson transformation can be used to transform skewed data to have a more normal distribution (see wikipedia).

step_BoxCox: Box-Cox transformation for non-negative data
step_YeoJohnson: Yeo-Johnson transformation

All these steps transform a single column and replace the column with the transformed value. This is different for the step_poly function.

Code

xy <- tibble(
    x=seq(-1, 1, length.out=100),
    y=seq(-1, 1, length.out=100)
)

transformed <- recipe(y ~ x, data=xy)  %>%
    step_poly(x, degree=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)

y	x_poly_1	x_poly_2	x_poly_3
-1.0000	-0.1715	0.2170	-0.2492
-0.9798	-0.1680	0.2038	-0.2190
-0.9596	-0.1646	0.1910	-0.1903
-0.9394	-0.1611	0.1783	-0.1632
-0.9192	-0.1576	0.1660	-0.1375
-0.8990	-0.1542	0.1539	-0.1132

The result of applying the step_poly function is three new columns x_poly_1, x_poly_2, and x_poly_3. The columns contain the first three orthogonal polynomials of the x column. The degree argument specifies the degree of the polynomials, here 3. The role argument is used to specify the role of the new columns. The default is predictor.

Figure 7.1 shows the effect of applying the step_poly function to the x column. The first polynomial x_poly_1 is linear, x_poly_2 is a transformation with a quadratic function, and x_poly_3 is a cubic function. The polynomials are orthogonal, which means that they are uncorrelated. This is useful when using the polynomials as predictors in a regression model.

Code

ggplot(transformed, aes(x=y, y=x_poly_1)) +
    geom_line() +
    geom_line(aes(y=x_poly_2), color="red") +
    geom_line(aes(y=x_poly_3), color="darkgreen") +
    labs(x="x", y="x_poly_i")

Figure 7.1: Orthogonal polynomials created using step_poly

Finally, there are several functions to convert a column to a variety of splines.

step_ns: Natural spline basis functions
step_bs: B-spline basis functions
step_spline_b: Basis splines
step_spline_convex: Convex splines
step_spline_monotone: Monotone splines
step_spline_natural: Natural splines
step_spline_nonnegative: Non-negative splines

7.3 Discretizing numeric variables

Sometimes, it can be useful to discretize a numeric variable, this means, convert the numeric values into a set of factors. This can be used for stepwise linear regression. The step_discretize function will convert a numeric variable into a set of factors using the quantiles of the variable.

Code

transformed <- recipe(y ~ x, data=xy)  %>%
    step_discretize(x, num_breaks=5) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)

x	y
bin1	-1.0000
bin1	-0.9798
bin1	-0.9596
bin1	-0.9394
bin1	-0.9192
bin1	-0.8990

By default, step_discretize will create four factors. Here, we specify num_breaks=5 to create five factors. Figure 7.2 shows the effect of applying the step_discretize function.

Code

ggplot(transformed, aes(x=y, y=x)) +
    geom_point()

Figure 7.2: Factor levels created using step_discretize

An alternative to using quantiles is to specify the breaks explicitly using the step_cut function.

Code

breaks <- c(-1.1, -0.8, 0.6, 0.7, 0.75)
transformed <- recipe(y ~ x, data=xy)  %>%
    step_cut(x, breaks=breaks) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)

x	y
[-1.1,-0.8]	-1.0000
[-1.1,-0.8]	-0.9798
[-1.1,-0.8]	-0.9596
[-1.1,-0.8]	-0.9394
[-1.1,-0.8]	-0.9192
[-1.1,-0.8]	-0.8990

Figure 7.3 shows the effect of applying the step_cut function.

Code

ggplot(transformed, aes(x=y, y=x)) +
    geom_vline(xintercept=breaks, color="grey") +
    geom_point()

Figure 7.3: Factor levels created using step_cut

During training, the range of the data will be used to determine the left and right boundaries of the bins. If a new data point falls outside this range, the value will be mapped to NA. will cause problems when predicting new data. To avoid this, we can use the include_outside_range argument to specify that values outside the range will be assigned to the first or last bin.

Code

transformed <- recipe(y ~ x, data=xy)  %>%
    step_cut(x, breaks=breaks, include_outside_range=TRUE) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)

x	y
[min,-0.8]	-1.0000
[min,-0.8]	-0.9798
[min,-0.8]	-0.9596
[min,-0.8]	-0.9394
[min,-0.8]	-0.9192
[min,-0.8]	-0.8990

You can see that the lowest range is now labeled as [min,-0.8].

7.4 Data normalization

Several model methods require data to be on the same scale. For example, assume a case where one property has a values in the 1000s, while another property has values between 0 and 10. In a \(k\)-nearest neighbor model the first property will dominate any distance measure while the second property will have little influence. To avoid this, we can normalize the data. The step_normalize function is used to normalize the data.

Code

rec_obj <- recipe(formula, data=data)  %>%
    step_normalize(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=3)

cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mpg
-0.105	-0.571	-0.535	0.568	-0.610	-0.777	-0.868	1.190	0.424	0.735	21.0
-0.105	-0.571	-0.535	0.568	-0.350	-0.464	-0.868	1.190	0.424	0.735	21.0
-1.225	-0.990	-0.783	0.474	-0.917	0.426	1.116	1.190	0.424	-1.122	22.8
-0.105	0.220	-0.535	-0.966	-0.002	0.890	1.116	-0.814	-0.932	-1.122	21.4
1.015	1.043	0.413	-0.835	0.228	-0.464	-0.868	-0.814	-0.932	-0.503	18.7
-0.105	-0.046	-0.608	-1.565	0.248	1.327	1.116	-0.814	-0.932	-1.122	18.1

Normalization will shift and scale each numerical column, so that its mean is 0 and the standard deviation is 1.

An alternative to normalization is set_range. In this case, the data will be transformed to fall into a given range.

Code

rec_obj <- recipe(formula, data=data)  %>%
    step_range(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=3)

cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mpg
0.5	0.222	0.205	0.525	0.283	0.233	0	1	0.5	0.429	21.0
0.5	0.222	0.205	0.525	0.348	0.300	0	1	0.5	0.429	21.0
0.0	0.092	0.145	0.502	0.206	0.489	1	1	0.5	0.000	22.8
0.5	0.466	0.205	0.147	0.435	0.588	1	0	0.0	0.000	21.4
1.0	0.721	0.435	0.180	0.493	0.300	0	0	0.0	0.143	18.7
0.5	0.384	0.187	0.000	0.498	0.681	1	0	0.0	0.000	18.1

The default range is [0,1]. You can specify a different range using the min and max argument.

Useful to know:

While methods like nearest neighbor` require normalization to work properly, other methods are not affected by the scale of the data. For example, decision trees handle each variable independently. However, it can still be beneficial to bring data to the same scale for numerical efficiency and stability.

7.5 Imputing missing data

If you expect your future data to have missing data, it will be useful to derive a strategy to deal with missing data not only for your training data but also for new data. The family of step_impute_* functions provide a variety of imputation strategies that are trained on the training data and applied to new data. To demonstrate this functionality, we will create a new dataset that contains missing values.

Code

set.seed(123)
data <- datasets::mtcars %>%
    as_tibble(rownames="car") %>%
    mutate_at(vars(cyl, wt, am),
        function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

missing_cyl <- is.na(data["cyl"])
missing_wt <- is.na(data["wt"])
missing_am <- is.na(data["am"])
missing_rows <- missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Datsun 710	22.8	4	108.0	93	3.85	NA	18.61	straight	manual	4	1
Valiant	18.1	NA	225.0	105	2.76	3.46	20.22	straight	automatic	3	1
Merc 280	19.2	6	167.6	123	3.92	3.44	18.30	straight	NA	4	4
Fiat 128	32.4	NA	78.7	66	4.08	2.20	19.47	straight	manual	4	1
Honda Civic	30.4	4	75.7	52	4.93	NA	18.52	straight	manual	4	2
Ferrari Dino	19.7	6	145.0	175	3.62	NA	15.50	V-shaped	manual	5	6

The mutate_at function adds about 10% missing data to the columns cyl, wt, and am.

For continuous numeric data, the mean and median are the most common imputation strategies (step_impute_mean or step_impute_median). For nominal data, the most common value is used (step_impute_mode).

Code

transformed <- recipe(mpg ~ ., data=data)  %>%
    step_impute_mean(wt) %>%
    step_impute_median(cyl) %>%
    step_impute_mode(am) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_rows, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")

car	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mpg
Datsun 710	4	108.0	93	3.85	3.319	18.61	straight	manual	4	1	22.8
Valiant	6	225.0	105	2.76	3.460	20.22	straight	automatic	3	1	18.1
Merc 280	6	167.6	123	3.92	3.440	18.30	straight	automatic	4	4	19.2
Fiat 128	6	78.7	66	4.08	2.200	19.47	straight	manual	4	1	32.4
Honda Civic	4	75.7	52	4.93	3.319	18.52	straight	manual	4	2	30.4
Ferrari Dino	6	145.0	175	3.62	3.319	15.50	V-shaped	manual	5	6	19.7

In our example, this added the values 3.3188621 to the missing values in the wt column, 6 to the missing values in the cyl column, and numeric to the missing values in the am column.

In some cases, a better approach is to use a model to impute the missing values.

step_impute_linear: Impute numeric variables via a linear model
step_impute_bag: Impute via bagged trees
step_impute_knn: Impute via k-nearest neighbors

We known from exploratory data that the wt columns is correlated with the disp and hp columns. We can use this information to impute the missing values in the wt column.

Code

transformed <- recipe(mpg ~ ., data=data)  %>%
    step_impute_linear(wt, impute_with=imp_vars(disp, hp)) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_wt, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")

car	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mpg
Datsun 710	4	108.0	93	3.85	2.390	18.61	straight	manual	4	1	22.8
Honda Civic	4	75.7	52	4.93	2.231	18.52	straight	manual	4	2	30.4
Ferrari Dino	6	145.0	175	3.62	2.492	15.50	V-shaped	manual	5	6	19.7

Here we used the step_impute_linear function to impute the missing values in the wt column. The impute_with argument is used to specify the variables that are used to impute the missing values. The imp_vars function is required here to specify the variables. We can see that in this case, the imputed values are different.

7.6 Dummy variables

Most methods cannot handle categorical variables without preprocessing. A common approach is to convert a categorical variable with \(C\) levels into several new columns that contain only 1 and 0 values. If reference cell parametrization is used, \(C-1\) new columns are created for all but the first factor level. In recipes, we use the function step_dummy to create these new dummy variables. Here is an example:

Code

penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>%
    sample_frac()
recipe(~ species, data=penguins) %>%
    step_dummy(species, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>%
    head()

## # A tibble: 6 × 3
##   species   species_Chinstrap species_Gentoo
##   <fct>                 <dbl>          <dbl>
## 1 Gentoo                    0              1
## 2 Adelie                    0              0
## 3 Adelie                    0              0
## 4 Chinstrap                 1              0
## 5 Adelie                    0              0
## 6 Gentoo                    0              1

We set the argument keep_original_cols to TRUE to include the original variable. By default, it would be removed. We can see that the step creates two new variables species_Chinstrap and species_Gentoo. They have values of 0 or 1, depending the value of species. If the species is Gentoo, species_Gentoo is set to 1 and the other variable to 0. For Chinstrap it is the other way round. For Adelie, both values are set to 0. Adelie is the reference value. This type of encoding is used for models that cannot deal with correlated data.

An alternative is one hot encoding. In this case, we create new columns for each factor level.

Code

recipe(~ species, data=penguins) %>%
    step_dummy(species, one_hot=TRUE, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>%
    head()

## # A tibble: 6 × 4
##   species   species_Adelie species_Chinstrap species_Gentoo
##   <fct>              <dbl>             <dbl>          <dbl>
## 1 Gentoo                 0                 0              1
## 2 Adelie                 1                 0              0
## 3 Adelie                 1                 0              0
## 4 Chinstrap              0                 1              0
## 5 Adelie                 1                 0              0
## 6 Gentoo                 0                 0              1

Setting one_hot=TRUE creates the additional column species_Adelie which is set to 1 for species Adelie. One hot encoding is usually used for \(k\)-NN and neural networks to treat each factor equivalently. As can be seen in 7.4, with one hot encoding Euclidean distances between the different factor levels are identical. With reference cell encoding, the reference level has the same distance to all other levels and this distance is shorter than the distances between the other levels.

Code

knitr::include_graphics("images/preprocess_dummy.png")

Figure 7.4: Effect of approach to generate dummy variables on distances. Left: reference cell encoding, right: one hot encoding.

To convert all nominal or categorical predictors into dummy variables use:

step_dummy(all_nominal_predictors())

See Handling categorical predictors for more details on handling categorical variables in tidymodels using recipes.

7.7 Interactions

The recipe package has also a way of defining interaction terms. While this could be done using a formula, the step_interact function is particularly useful to define interactions with variables that were created in a previous step.

Here is an example, where we first convert the vs predictor into a dummy variable using one-hot-encoding. this replaces the vs predictor with vs_V.shaped and vs_straight. In the next step, we want to create interaction terms of these two predictors with hp. We define this using the formula ~ (vs_V.shaped + vs_straight):hp. If the factor has several levels, it will be more concise to select the predictors using `starts_with(“vs”)’.

Code

transformed <- recipe(mpg~vs+hp, data=data)  %>%
    step_dummy(vs, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):hp) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head(3)

## # A tibble: 3 × 6
##      hp   mpg vs_V.shaped vs_straight vs_V.shaped_x_hp vs_straight_x_hp
##   <dbl> <dbl>       <dbl>       <dbl>            <dbl>            <dbl>
## 1   110  21             1           0              110                0
## 2   110  21             1           0              110                0
## 3    93  22.8           0           1                0               93

We can see that step_interact creates two new columns, vs_V.shaped_x_hp and vs_straight_x_hp. The interacting terms are separated using _x_ (sep argument).

The following example adds interaction terms between two factors that were one-hot-encoded.

Code

transformed <- recipe(mpg~vs+am, data=data)  %>%
    step_dummy(vs, am, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):starts_with("am")) %>%
    prep() %>%
    bake(new_data=NULL)

## Warning: ! There are new levels in `am`: NA.
## ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.

Code

transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")

mpg	vs_V.shaped	vs_straight	am_automatic	am_manual	vs_V.shaped_x_am_automatic	vs_V.shaped_x_am_manual	vs_straight_x_am_automatic	vs_straight_x_am_manual
21.0	1	0	0	1	0	1	0	0
21.0	1	0	0	1	0	1	0	0
22.8	0	1	0	1	0	0	0	1
21.4	0	1	1	0	0	0	1	0
18.7	1	0	1	0	1	0	0	0
18.1	0	1	1	0	0	0	1	0

This adds four new columns.

You can also create all possible interactions by using mpg ~ .:., mpg ~ .**2 or mpg ~ .^2 where the . represents all remaining columns after removing mpg.

Code

transformed <- recipe(mpg~hp + wt + vs, data=data)  %>%
    step_interact(mpg ~ .^2) %>%
    # step_interact(mpg ~ .:.) %>%
    # step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")

hp	wt	vs	mpg	hp_x_wt	hp_x_vsstraight	wt_x_vsstraight
110	2.620	V-shaped	21.0	288.20	0	0.000
110	2.875	V-shaped	21.0	316.25	0	0.000
93	NA	straight	22.8	NA	93	NA
110	3.215	straight	21.4	353.65	110	3.215
175	3.440	V-shaped	18.7	602.00	0	0.000
105	3.460	straight	18.1	363.30	105	3.460

Using the formula ~ all_numeric_predictors()**2 will only create interaction terms between numerical predictors

Code

transformed <- recipe(mpg~hp + disp + vs, data=data)  %>%
    # step_interact(mpg ~ .^2) %>%  <= includes vs
    # step_interact(mpg ~ .:.) %>%  <= includes vs
    step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")

Table 7.1: Table created by adding all interaction between numerical predictors
hp	disp	vs	mpg	hp_x_disp
110	160	V-shaped	21.0	17600
110	160	V-shaped	21.0	17600
93	108	straight	22.8	10044
110	258	straight	21.4	28380
175	360	V-shaped	18.7	63000
105	225	straight	18.1	23625

Note, this will not add quadratic terms.

Useful to know:

Table 7.1 shows the results of adding an interaction between hp and disp. The original predictors have ranges of hp=[52, 335] and disp=[71.1, 472]. Both ranges are similar. The interaction term however has a much wider and larger range hp_x_disp=[ 3936.4, 101200.0].

If you use a model that is based on distances like \(k\)-NN, it is important to normalize the data (see 7.4). Otherwise, the interaction term will dominate the distances and reduce the influence the main terms can have on the model.

7.8 Principal components

The recipe package has several functions that combine multiple columns. Here, we will only discuss the step_pca function. It is used to create principal components. For more information on PCA, see section 18.1 It is recommended to use the step_normalize function prior to using step_pca.

Code

data <- datasets::mtcars %>%
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

transformed <- recipe(mpg~., data=data)  %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3)

car	vs	am	mpg	PC1	PC2	PC3
Mazda RX4	V-shaped	manual	21.0	-0.647	1.183	0.270
Mazda RX4 Wag	V-shaped	manual	21.0	-0.622	0.986	-0.061
Datsun 710	straight	manual	22.8	-2.308	-0.293	0.352
Hornet 4 Drive	straight	automatic	21.4	-0.155	-1.981	0.281
Hornet Sportabout	V-shaped	automatic	18.7	1.628	-0.857	0.933
Valiant	straight	automatic	18.1	-0.107	-2.437	-0.058

The predictors are now reduced to three numerical columns called PC1, PC2, and PC3. The outcome mpg and the two categorical predictors vs and am are left unchanged.

7.9 Filtering variables

So far, we covered preprocessing steps that transform columns into one or more columns. The recipe package also contains methods to remove columns from the dataset. The most basic ones are step_rm and step_select, which remove one or more column from the dataset by name.

Other filters take the information in the column into account. step_filter_missing removes columns where the number of missing data surpasses a given threshold. This is useful for columns where imputation is not feasible.

The step_zv and step_nzv functions remove columns that are constant or almost constant. Such columns contain in general little information and can be removed without limiting the performance of models.

Another source for redundant information are columns that are highly correlated with other columns or columns that are linear combinations of other columns. In some cases, leaving these columns in the dataset can cause numerical problems. The step_corr function removes columns that are highly correlated with other columns. The step_lincomb function removes columns that are linear combinations of other columns.

Further information:

The tidymodels package parsnip is the package that is responsible to define and fit models. You find detailed information about each of the model types, the specific engines and their options in the documentation.

https://recipes.tidymodels.org/ is the documentation for the recipe package.
https://recipes.tidymodels.org/reference/index.html lists all the different preprocessing steps that are available in recipe
https://bookdown.org/max/FES/ This book by the authors of tidymodels covers many aspects of feature engineering.

Code

The code of this chapter is summarized here.

Code

knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_recipe.png")
library(tidyverse)
library(tidymodels)
library(patchwork)
library(kableExtra)
data <- datasets::mtcars %>% as_tibble(rownames="car")
data %>%
    head() %>%
    knitr::kable()
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)
summary(rec_obj)
rec_obj <- rec_obj %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual"))
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
rec_obj <- rec_obj %>%
    step_normalize(all_numeric_predictors())
rec_obj %>%
    prep() %>%
    bake(new_data = NULL) %>%
    top_n(4)
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)  %>%
    step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
    step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
    step_normalize(all_numeric_predictors())
xy <- tibble(
    x=seq(-1, 1, length.out=100),
    y=seq(-1, 1, length.out=100)
)

transformed <- recipe(y ~ x, data=xy)  %>%
    step_poly(x, degree=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x_poly_1)) +
    geom_line() +
    geom_line(aes(y=x_poly_2), color="red") +
    geom_line(aes(y=x_poly_3), color="darkgreen") +
    labs(x="x", y="x_poly_i")
transformed <- recipe(y ~ x, data=xy)  %>%
    step_discretize(x, num_breaks=5) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
    geom_point()
breaks <- c(-1.1, -0.8, 0.6, 0.7, 0.75)
transformed <- recipe(y ~ x, data=xy)  %>%
    step_cut(x, breaks=breaks) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
    geom_vline(xintercept=breaks, color="grey") +
    geom_point()
transformed <- recipe(y ~ x, data=xy)  %>%
    step_cut(x, breaks=breaks, include_outside_range=TRUE) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=4) %>%
    kableExtra::kable_styling(full_width=FALSE)
rec_obj <- recipe(formula, data=data)  %>%
    step_normalize(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=3)
rec_obj <- recipe(formula, data=data)  %>%
    step_range(all_numeric_predictors())
transformed <- rec_obj %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>%
    head() %>%
    knitr::kable(digits=3)
set.seed(123)
data <- datasets::mtcars %>%
    as_tibble(rownames="car") %>%
    mutate_at(vars(cyl, wt, am),
        function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

missing_cyl <- is.na(data["cyl"])
missing_wt <- is.na(data["wt"])
missing_am <- is.na(data["am"])
missing_rows <- missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")
transformed <- recipe(mpg ~ ., data=data)  %>%
    step_impute_mean(wt) %>%
    step_impute_median(cyl) %>%
    step_impute_mode(am) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_rows, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")
transformed <- recipe(mpg ~ ., data=data)  %>%
    step_impute_linear(wt, impute_with=imp_vars(disp, hp)) %>%
    prep() %>%
    bake(new_data=NULL)
transformed[missing_wt, ] %>%
    knitr::kable(digits=3) %>%
    scroll_box(width = "100%")
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>%
    sample_frac()
recipe(~ species, data=penguins) %>%
    step_dummy(species, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>%
    head()
recipe(~ species, data=penguins) %>%
    step_dummy(species, one_hot=TRUE, keep_original_cols=TRUE) %>%
    prep() %>%
    bake(new_data=NULL) %>%
    head()
knitr::include_graphics("images/preprocess_dummy.png")
transformed <- recipe(mpg~vs+hp, data=data)  %>%
    step_dummy(vs, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):hp) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head(3)
transformed <- recipe(mpg~vs+am, data=data)  %>%
    step_dummy(vs, am, one_hot=TRUE) %>%
    step_interact(~ starts_with("vs"):starts_with("am")) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + wt + vs, data=data)  %>%
    step_interact(mpg ~ .^2) %>%
    # step_interact(mpg ~ .:.) %>%
    # step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + disp + vs, data=data)  %>%
    # step_interact(mpg ~ .^2) %>%  <= includes vs
    # step_interact(mpg ~ .:.) %>%  <= includes vs
    step_interact(~ all_numeric_predictors()**2) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")
data <- datasets::mtcars %>%
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )

transformed <- recipe(mpg~., data=data)  %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp=3) %>%
    prep() %>%
    bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3)