Chapter 7 Data preprocessing
We learned in the previous chapters, how to use dplyr
to preprocess data. While this is useful for data exploration and cleanup, it is not enough for building models. For example, you may need to normalize predictors prior to the actual training step. The normalization transformation depends on the distribution of the training data and the exact same transformation needs to be applied to new data. Because of this, it is important to include preprocessing steps in the modeling pipeline.
The preprocessing steps can be used to:
- create new features,
- transform the data to make it more suitable for the model,
- introduce non-linearity into the model
- reduce the number of features, and
- impute missing data
The tidymodels framework makes this easy. The preprocessing steps are defined using the recipe
package and combined with the model using a pipeline that is created using the workflows
package. In this chapter, we will learn how to use the recipe
package to preprocess data and build models.
Load required packages:
7.1 Preprocessing data with recipes
Let’s use the mtcars
dataset as an example. The mtcars
dataset contains 32 observations (rows) and 11 variables (columns); check ?mtcars
for details on the dataset. The goal is to predict the fuel consumption (mpg) of a car based on the other variables. Here is the dataset:
car | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Two of the variables are categorical; transmission type (am) and engine shape (vs). We can also see that the continuous variables have different scales. For example, the displacement (wt) is in the hundreds while the number of cylinders (cyl) is in the single digits. Our plan for the preprocessing steps in the modeling pipeline is to:
- Convert the categorical variables to factors
- Normalize the continuous variables
We will use the recipe
package to define the preprocessing steps. The first step is to create a recipe object using the recipe()
function. The first argument is a formula that specifies the outcome variable and the predictors. The second argument is the data frame that contains the data. The recipe()
function returns a recipe object that contains the preprocessing steps. The summary()
function can be used to display the recipe object:
Code
## # A tibble: 11 × 4
## variable type role source
## <chr> <list> <chr> <chr>
## 1 cyl <chr [2]> predictor original
## 2 disp <chr [2]> predictor original
## 3 hp <chr [2]> predictor original
## 4 drat <chr [2]> predictor original
## 5 wt <chr [2]> predictor original
## 6 qsec <chr [2]> predictor original
## 7 vs <chr [2]> predictor original
## 8 am <chr [2]> predictor original
## 9 gear <chr [2]> predictor original
## 10 carb <chr [2]> predictor original
## 11 mpg <chr [2]> outcome original
The output tells us which variables are included in the model and what their respective role is. The role is just a label that is used to identify the variables. For our use, the automatically assigned roles of predictor
and outcome
are fine.
Now that we have a recipe, we can add preprocessing steps. The functions have the general format step_{X}
,
rec_obj <- step_{X}(rec_obj, ..., arguments) ## or
rec_obj <- rec_obj %>% step_{X}(..., arguments)
The ...
stands for a selection of variables. This could either be a list of variable names or a selector like all_predictors
, all_numeric
, or similar ones. More about this later. The remaining arguments are keyword arguments and require specifying the name of the argument.
The function step_num2factor
converts a numerical column to a factor column. The first argument is the recipe object. The second argument is the name of the variable to be converted. The levels
argument is used to specify the levels of the factor. The transform
argument is used to specify a function that is applied to the variable before it is converted to a factor.
Code
The levels
array is used to map the number in the column vs
to a string. The values of vs
are 0 and 1, so a simple lookup won’t work. We need to first transform the value before we can use it as an index into the levels
array. This is done using the transform
function. For a value of 0 is changed to 1 by the transform
function and then used as a index to look up the string “V-shaped” in the levels
array. Similarly, a value of 1 is changed to 2 and then through lookup converted to “straight”. If your values are already mapping to the correct indices, you can omit the transform
argument. Finally, the whole column is changed to a factor. The second step does a similar transformation of the am
column.
We can look at the result of the recipe so far using the prep()
and bake()
functions. The prep()
function trains the steps using, in this case, the data. You can use the training
argument to specify a different dataset. The bake()
function applies the recipe to the data. The new_data
argument is used to specify a different dataset. If the new_data
argument is omitted, the recipe is applied to the data that was used to train the recipe.
## Selecting by mpg
## # A tibble: 4 × 11
## cyl disp hp drat wt qsec vs am gear carb mpg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 4 78.7 66 4.08 2.2 19.5 straight manual 4 1 32.4
## 2 4 75.7 52 4.93 1.62 18.5 straight manual 4 2 30.4
## 3 4 71.1 65 4.22 1.84 19.9 straight manual 4 1 33.9
## 4 4 95.1 113 3.77 1.51 16.9 straight manual 5 2 30.4
Applying the recipe to the dataset results in a tibble where the columns vs
and am
are now factors.
The next step is to normalize the continuous variables. The step_normalize()
function is used to normalize the variables.
Code
## Selecting by mpg
## # A tibble: 4 × 11
## cyl disp hp drat wt qsec vs am gear carb mpg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 -1.22 -1.23 -1.18 0.904 -1.04 0.907 straight manual 0.424 -1.12 32.4
## 2 -1.22 -1.25 -1.38 2.49 -1.64 0.376 straight manual 0.424 -0.503 30.4
## 3 -1.22 -1.29 -1.19 1.17 -1.41 1.15 straight manual 0.424 -1.12 33.9
## 4 -1.22 -1.09 -0.491 0.324 -1.74 -0.531 straight manual 1.78 -0.503 30.4
Here, we use the all_numeric_predictors()
selector to specify all the numeric variables that are labeled as predictor
. During the prep
step, the mean and standard deviation of the variables are computed and stored with the recipe. The values are used in the bake
step to transform the data. As we can see, the continuous variables are now normalized. Had we used all_numeric
instead of all_numeric_predictors
, the outcome variable mpg
would have been normalized as well.
To summarize, the recipe for preprocessing the data is:
Code
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data) %>%
step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
step_normalize(all_numeric_predictors())
Todo:
Now is a good time to look through the reference of the recipe
package to get an overview of what is available.
In the following, we highlight some of the more commonly used functions.
7.2 Transformations of individual features
The following steps apply numerical transformations
step_inverse
: \(f(x) = 1/x\)step_invlogit
: \(f(x) = 1/(1+exp(-x))\)step_log
: \(f(x) = log(x)\)step_logit
: \(f(x) = log(x/(1-x))\)step_sqrt
: \(f(x) = \sqrt{x}\)
The step_mutate()
can be used like the dplyr::mutate
function.
The Box-Cox transformation and the Yeo-Johnson transformation can be used to transform skewed data to have a more normal distribution (see wikipedia).
step_BoxCox
: Box-Cox transformation for non-negative datastep_YeoJohnson
: Yeo-Johnson transformation
All these steps transform a single column and replace the column with the transformed value. This is different for the step_poly
function.
Code
y | x_poly_1 | x_poly_2 | x_poly_3 |
---|---|---|---|
-1.0000 | -0.1715 | 0.2170 | -0.2492 |
-0.9798 | -0.1680 | 0.2038 | -0.2190 |
-0.9596 | -0.1646 | 0.1910 | -0.1903 |
-0.9394 | -0.1611 | 0.1783 | -0.1632 |
-0.9192 | -0.1576 | 0.1660 | -0.1375 |
-0.8990 | -0.1542 | 0.1539 | -0.1132 |
The result of applying the step_poly
function is three new columns x_poly_1
, x_poly_2
, and x_poly_3
. The columns contain the first three orthogonal polynomials of the x
column. The degree
argument specifies the degree of the polynomials, here 3. The role
argument is used to specify the role of the new columns. The default is predictor
.
Figure 7.1 shows the effect of applying the step_poly
function to the x
column. The first polynomial x_poly_1
is linear, x_poly_2
is a transformation with a quadratic function, and x_poly_3
is a cubic function. The polynomials are orthogonal, which means that they are uncorrelated. This is useful when using the polynomials as predictors in a regression model.
Code
Finally, there are several functions to convert a column to a variety of splines.
step_ns
: Natural spline basis functionsstep_bs
: B-spline basis functionsstep_spline_b
: Basis splinesstep_spline_convex
: Convex splinesstep_spline_monotone
: Monotone splinesstep_spline_natural
: Natural splinesstep_spline_nonnegative
: Non-negative splines
7.3 Discretizing numeric variables
Sometimes, it can be useful to discretize a numeric variable, this means, convert the numeric values into a set of factors. This can be used for stepwise linear regression. The step_discretize
function will convert a numeric variable into a set of factors using the quantiles of the variable.
Code
x | y |
---|---|
bin1 | -1.0000 |
bin1 | -0.9798 |
bin1 | -0.9596 |
bin1 | -0.9394 |
bin1 | -0.9192 |
bin1 | -0.8990 |
By default, step_discretize
will create four factors. Here, we specify num_breaks=5
to create five factors. Figure 7.2 shows the effect of applying the step_discretize
function.
An alternative to using quantiles is to specify the breaks explicitly using the step_cut
function.
Code
x | y |
---|---|
[-1.1,-0.8] | -1.0000 |
[-1.1,-0.8] | -0.9798 |
[-1.1,-0.8] | -0.9596 |
[-1.1,-0.8] | -0.9394 |
[-1.1,-0.8] | -0.9192 |
[-1.1,-0.8] | -0.8990 |
Figure 7.3 shows the effect of applying the step_cut
function.
Code
During training, the range of the data will be used to determine the left and right boundaries of the bins. If a new data point falls outside this range, the value will be mapped to NA
. will cause problems when predicting new data. To avoid this, we can use the include_outside_range
argument to specify that values outside the range will be assigned to the first or last bin.
Code
x | y |
---|---|
[min,-0.8] | -1.0000 |
[min,-0.8] | -0.9798 |
[min,-0.8] | -0.9596 |
[min,-0.8] | -0.9394 |
[min,-0.8] | -0.9192 |
[min,-0.8] | -0.8990 |
You can see that the lowest range is now labeled as [min,-0.8]
.
7.4 Data normalization
Several model methods require data to be on the same scale. For example, assume a case where one property has a values in the 1000s, while another property has values between 0 and 10. In a \(k\)-nearest neighbor model the first property will dominate any distance measure while the second property will have little influence. To avoid this, we can normalize the data. The step_normalize
function is used to normalize the data.
Code
cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg |
---|---|---|---|---|---|---|---|---|---|---|
-0.105 | -0.571 | -0.535 | 0.568 | -0.610 | -0.777 | -0.868 | 1.190 | 0.424 | 0.735 | 21.0 |
-0.105 | -0.571 | -0.535 | 0.568 | -0.350 | -0.464 | -0.868 | 1.190 | 0.424 | 0.735 | 21.0 |
-1.225 | -0.990 | -0.783 | 0.474 | -0.917 | 0.426 | 1.116 | 1.190 | 0.424 | -1.122 | 22.8 |
-0.105 | 0.220 | -0.535 | -0.966 | -0.002 | 0.890 | 1.116 | -0.814 | -0.932 | -1.122 | 21.4 |
1.015 | 1.043 | 0.413 | -0.835 | 0.228 | -0.464 | -0.868 | -0.814 | -0.932 | -0.503 | 18.7 |
-0.105 | -0.046 | -0.608 | -1.565 | 0.248 | 1.327 | 1.116 | -0.814 | -0.932 | -1.122 | 18.1 |
Normalization will shift and scale each numerical column, so that its mean is 0 and the standard deviation is 1.
An alternative to normalization is set_range
. In this case, the data will be transformed to fall into a given range.
Code
cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg |
---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.222 | 0.205 | 0.525 | 0.283 | 0.233 | 0 | 1 | 0.5 | 0.429 | 21.0 |
0.5 | 0.222 | 0.205 | 0.525 | 0.348 | 0.300 | 0 | 1 | 0.5 | 0.429 | 21.0 |
0.0 | 0.092 | 0.145 | 0.502 | 0.206 | 0.489 | 1 | 1 | 0.5 | 0.000 | 22.8 |
0.5 | 0.466 | 0.205 | 0.147 | 0.435 | 0.588 | 1 | 0 | 0.0 | 0.000 | 21.4 |
1.0 | 0.721 | 0.435 | 0.180 | 0.493 | 0.300 | 0 | 0 | 0.0 | 0.143 | 18.7 |
0.5 | 0.384 | 0.187 | 0.000 | 0.498 | 0.681 | 1 | 0 | 0.0 | 0.000 | 18.1 |
The default range is [0,1]
. You can specify a different range using the min
and max
argument.
Useful to know:
While methods like nearest neighbor` require normalization to work properly, other methods are not affected by the scale of the data. For example, decision trees handle each variable independently. However, it can still be beneficial to bring data to the same scale for numerical efficiency and stability.
7.5 Imputing missing data
If you expect your future data to have missing data, it will be useful to derive a strategy to deal with missing data not only for your training data but also for new data. The family of step_impute_*
functions provide a variety of imputation strategies that are trained on the training data and applied to new data. To demonstrate this functionality, we will create a new dataset that contains missing values.
Code
set.seed(123)
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate_at(vars(cyl, wt, am),
function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
missing_cyl <- is.na(data["cyl"])
missing_wt <- is.na(data["wt"])
missing_am <- is.na(data["am"])
missing_rows <- missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>%
knitr::kable(digits=3) %>%
scroll_box(width = "100%")
car | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | NA | 18.61 | straight | manual | 4 | 1 |
Valiant | 18.1 | NA | 225.0 | 105 | 2.76 | 3.46 | 20.22 | straight | automatic | 3 | 1 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.30 | straight | NA | 4 | 4 |
Fiat 128 | 32.4 | NA | 78.7 | 66 | 4.08 | 2.20 | 19.47 | straight | manual | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | NA | 18.52 | straight | manual | 4 | 2 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | NA | 15.50 | V-shaped | manual | 5 | 6 |
The mutate_at
function adds about 10% missing data to the columns cyl
, wt
, and am
.
For continuous numeric data, the mean and median are the most common imputation strategies (step_impute_mean
or step_impute_median
). For nominal data, the most common value is used (step_impute_mode
).
Code
car | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg |
---|---|---|---|---|---|---|---|---|---|---|---|
Datsun 710 | 4 | 108.0 | 93 | 3.85 | 3.319 | 18.61 | straight | manual | 4 | 1 | 22.8 |
Valiant | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | straight | automatic | 3 | 1 | 18.1 |
Merc 280 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | straight | automatic | 4 | 4 | 19.2 |
Fiat 128 | 6 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | straight | manual | 4 | 1 | 32.4 |
Honda Civic | 4 | 75.7 | 52 | 4.93 | 3.319 | 18.52 | straight | manual | 4 | 2 | 30.4 |
Ferrari Dino | 6 | 145.0 | 175 | 3.62 | 3.319 | 15.50 | V-shaped | manual | 5 | 6 | 19.7 |
In our example, this added the values 3.3188621 to the missing values in the wt
column, 6 to the missing values in the cyl
column, and numeric to the missing values in the am
column.
In some cases, a better approach is to use a model to impute the missing values.
step_impute_linear
: Impute numeric variables via a linear modelstep_impute_bag
: Impute via bagged treesstep_impute_knn
: Impute via k-nearest neighbors
We known from exploratory data that the wt
columns is correlated with the disp
and hp
columns. We can use this information to impute the missing values in the wt
column.
Code
car | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | mpg |
---|---|---|---|---|---|---|---|---|---|---|---|
Datsun 710 | 4 | 108.0 | 93 | 3.85 | 2.390 | 18.61 | straight | manual | 4 | 1 | 22.8 |
Honda Civic | 4 | 75.7 | 52 | 4.93 | 2.231 | 18.52 | straight | manual | 4 | 2 | 30.4 |
Ferrari Dino | 6 | 145.0 | 175 | 3.62 | 2.492 | 15.50 | V-shaped | manual | 5 | 6 | 19.7 |
Here we used the step_impute_linear
function to impute the missing values in the wt
column. The impute_with
argument is used to specify the variables that are used to impute the missing values. The imp_vars
function is required here to specify the variables. We can see that in this case, the imputed values are different.
7.6 Dummy variables
Most methods cannot handle categorical variables without preprocessing. A common approach is to convert a categorical variable with \(C\) levels into several new columns that contain only 1 and 0 values. If reference cell parametrization is used, \(C-1\) new columns are created for all but the first factor level. In recipes
, we use the function step_dummy
to create these new dummy variables. Here is an example:
Code
## # A tibble: 6 × 3
## species species_Chinstrap species_Gentoo
## <fct> <dbl> <dbl>
## 1 Gentoo 0 1
## 2 Adelie 0 0
## 3 Adelie 0 0
## 4 Chinstrap 1 0
## 5 Adelie 0 0
## 6 Gentoo 0 1
We set the argument keep_original_cols
to TRUE
to include the original variable. By default, it would be removed. We can see that the step creates two new variables species_Chinstrap
and species_Gentoo
. They have values of 0 or 1, depending the value of species
. If the species is Gentoo, species_Gentoo
is set to 1 and the other variable to 0. For Chinstrap it is the other way round. For Adelie, both values are set to 0. Adelie is the reference value. This type of encoding is used for models that cannot deal with correlated data.
An alternative is one hot encoding. In this case, we create new columns for each factor level.
Code
## # A tibble: 6 × 4
## species species_Adelie species_Chinstrap species_Gentoo
## <fct> <dbl> <dbl> <dbl>
## 1 Gentoo 0 0 1
## 2 Adelie 1 0 0
## 3 Adelie 1 0 0
## 4 Chinstrap 0 1 0
## 5 Adelie 1 0 0
## 6 Gentoo 0 0 1
Setting one_hot=TRUE
creates the additional column species_Adelie
which is set to 1 for species Adelie. One hot encoding is usually used for \(k\)-NN and neural networks to treat each factor equivalently. As can be seen in 7.4, with one hot encoding Euclidean distances between the different factor levels are identical. With reference cell encoding, the reference level has the same distance to all other levels and this distance is shorter than the distances between the other levels.
To convert all nominal or categorical predictors into dummy variables use:
step_dummy(all_nominal_predictors())
See Handling categorical predictors for more details on handling categorical variables in tidymodels using recipes
.
7.7 Interactions
The recipe
package has also a way of defining interaction terms. While this could be done using a formula, the step_interact
function is particularly useful to define interactions with variables that were created in a previous step.
Here is an example, where we first convert the vs
predictor into a dummy variable using one-hot-encoding. this replaces the vs
predictor with vs_V.shaped
and vs_straight
. In the next step, we want to create interaction terms of these two predictors with hp
. We define this using the formula ~ (vs_V.shaped + vs_straight):hp
. If the factor has several levels, it will be more concise to select the predictors using `starts_with(“vs”)’.
Code
## # A tibble: 3 × 6
## hp mpg vs_V.shaped vs_straight vs_V.shaped_x_hp vs_straight_x_hp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 110 21 1 0 110 0
## 2 110 21 1 0 110 0
## 3 93 22.8 0 1 0 93
We can see that step_interact
creates two new columns, vs_V.shaped_x_hp
and vs_straight_x_hp
. The interacting terms are separated using _x_
(sep
argument).
The following example adds interaction terms between two factors that were one-hot-encoded.
Code
## Warning: ! There are new levels in `am`: NA.
## ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
mpg | vs_V.shaped | vs_straight | am_automatic | am_manual | vs_V.shaped_x_am_automatic | vs_V.shaped_x_am_manual | vs_straight_x_am_automatic | vs_straight_x_am_manual |
---|---|---|---|---|---|---|---|---|
21.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
21.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
22.8 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
21.4 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
18.7 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
18.1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
This adds four new columns.
You can also create all possible interactions by using mpg ~ .:.
, mpg ~ .**2
or mpg ~ .^2
where the .
represents all remaining columns after removing mpg
.
Code
hp | wt | vs | mpg | hp_x_wt | hp_x_vsstraight | wt_x_vsstraight |
---|---|---|---|---|---|---|
110 | 2.620 | V-shaped | 21.0 | 288.20 | 0 | 0.000 |
110 | 2.875 | V-shaped | 21.0 | 316.25 | 0 | 0.000 |
93 | NA | straight | 22.8 | NA | 93 | NA |
110 | 3.215 | straight | 21.4 | 353.65 | 110 | 3.215 |
175 | 3.440 | V-shaped | 18.7 | 602.00 | 0 | 0.000 |
105 | 3.460 | straight | 18.1 | 363.30 | 105 | 3.460 |
Using the formula ~ all_numeric_predictors()**2
will only create interaction terms between numerical predictors
Code
transformed <- recipe(mpg~hp + disp + vs, data=data) %>%
# step_interact(mpg ~ .^2) %>% <= includes vs
# step_interact(mpg ~ .:.) %>% <= includes vs
step_interact(~ all_numeric_predictors()**2) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")
hp | disp | vs | mpg | hp_x_disp |
---|---|---|---|---|
110 | 160 | V-shaped | 21.0 | 17600 |
110 | 160 | V-shaped | 21.0 | 17600 |
93 | 108 | straight | 22.8 | 10044 |
110 | 258 | straight | 21.4 | 28380 |
175 | 360 | V-shaped | 18.7 | 63000 |
105 | 225 | straight | 18.1 | 23625 |
Note, this will not add quadratic terms.
Useful to know:
Table 7.1 shows the results of adding an interaction between hp
and disp
. The original predictors have ranges of hp=[52, 335] and disp=[71.1, 472]. Both ranges are similar. The interaction term however has a much wider and larger range hp_x_disp=[ 3936.4, 101200.0].
If you use a model that is based on distances like \(k\)-NN, it is important to normalize the data (see 7.4). Otherwise, the interaction term will dominate the distances and reduce the influence the main terms can have on the model.
7.8 Principal components
The recipe
package has several functions that combine multiple columns. Here, we will only discuss the step_pca
function. It is used to create principal components. For more information on PCA, see section 18.1 It is recommended to use the step_normalize
function prior to using step_pca
.
Code
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
transformed <- recipe(mpg~., data=data) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), num_comp=3) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3)
car | vs | am | mpg | PC1 | PC2 | PC3 |
---|---|---|---|---|---|---|
Mazda RX4 | V-shaped | manual | 21.0 | -0.647 | 1.183 | 0.270 |
Mazda RX4 Wag | V-shaped | manual | 21.0 | -0.622 | 0.986 | -0.061 |
Datsun 710 | straight | manual | 22.8 | -2.308 | -0.293 | 0.352 |
Hornet 4 Drive | straight | automatic | 21.4 | -0.155 | -1.981 | 0.281 |
Hornet Sportabout | V-shaped | automatic | 18.7 | 1.628 | -0.857 | 0.933 |
Valiant | straight | automatic | 18.1 | -0.107 | -2.437 | -0.058 |
The predictors are now reduced to three numerical columns called PC1
, PC2
, and PC3
. The outcome mpg
and the two categorical predictors vs
and am
are left unchanged.
7.9 Filtering variables
So far, we covered preprocessing steps that transform columns into one or more columns. The recipe
package also contains methods to remove columns from the dataset. The most basic ones are step_rm
and step_select
, which remove one or more column from the dataset by name.
Other filters take the information in the column into account. step_filter_missing
removes columns where the number of missing data surpasses a given threshold. This is useful for columns where imputation is not feasible.
The step_zv
and step_nzv
functions remove columns that are constant or almost constant. Such columns contain in general little information and can be removed without limiting the performance of models.
Another source for redundant information are columns that are highly correlated with other columns or columns that are linear combinations of other columns. In some cases, leaving these columns in the dataset can cause numerical problems.
The step_corr
function removes columns that are highly correlated with other columns. The step_lincomb
function removes columns that are linear combinations of other columns.
Further information:
The tidymodels package parsnip
is the package that is responsible to define and fit models. You find detailed information about each of the model types, the specific engines and their options in the documentation.
- https://recipes.tidymodels.org/ is the documentation for the
recipe
package. - https://recipes.tidymodels.org/reference/index.html lists all the different preprocessing steps that are available in
recipe
- https://bookdown.org/max/FES/ This book by the authors of tidymodels covers many aspects of feature engineering.
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_recipe.png")
library(tidyverse)
library(tidymodels)
library(patchwork)
library(kableExtra)
data <- datasets::mtcars %>% as_tibble(rownames="car")
data %>%
head() %>%
knitr::kable()
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data)
summary(rec_obj)
rec_obj <- rec_obj %>%
step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual"))
rec_obj %>%
prep() %>%
bake(new_data = NULL) %>%
top_n(4)
rec_obj <- rec_obj %>%
step_normalize(all_numeric_predictors())
rec_obj %>%
prep() %>%
bake(new_data = NULL) %>%
top_n(4)
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
rec_obj <- recipe(formula, data=data) %>%
step_num2factor(vs, transform=function(x) x + 1, levels=c("V-shaped", "straight")) %>%
step_num2factor(am, transform=function(x) x + 1, levels=c("automatic", "manual")) %>%
step_normalize(all_numeric_predictors())
xy <- tibble(
x=seq(-1, 1, length.out=100),
y=seq(-1, 1, length.out=100)
)
transformed <- recipe(y ~ x, data=xy) %>%
step_poly(x, degree=3) %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=4) %>%
kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x_poly_1)) +
geom_line() +
geom_line(aes(y=x_poly_2), color="red") +
geom_line(aes(y=x_poly_3), color="darkgreen") +
labs(x="x", y="x_poly_i")
transformed <- recipe(y ~ x, data=xy) %>%
step_discretize(x, num_breaks=5) %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=4) %>%
kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
geom_point()
breaks <- c(-1.1, -0.8, 0.6, 0.7, 0.75)
transformed <- recipe(y ~ x, data=xy) %>%
step_cut(x, breaks=breaks) %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=4) %>%
kableExtra::kable_styling(full_width=FALSE)
ggplot(transformed, aes(x=y, y=x)) +
geom_vline(xintercept=breaks, color="grey") +
geom_point()
transformed <- recipe(y ~ x, data=xy) %>%
step_cut(x, breaks=breaks, include_outside_range=TRUE) %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=4) %>%
kableExtra::kable_styling(full_width=FALSE)
rec_obj <- recipe(formula, data=data) %>%
step_normalize(all_numeric_predictors())
transformed <- rec_obj %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=3)
rec_obj <- recipe(formula, data=data) %>%
step_range(all_numeric_predictors())
transformed <- rec_obj %>%
prep() %>%
bake(new_data=NULL)
transformed %>%
head() %>%
knitr::kable(digits=3)
set.seed(123)
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate_at(vars(cyl, wt, am),
function(x) ifelse(runif(length(x)) < 0.1, NA, x)) %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
missing_cyl <- is.na(data["cyl"])
missing_wt <- is.na(data["wt"])
missing_am <- is.na(data["am"])
missing_rows <- missing_cyl | missing_wt | missing_am
data[missing_rows, ] %>%
knitr::kable(digits=3) %>%
scroll_box(width = "100%")
transformed <- recipe(mpg ~ ., data=data) %>%
step_impute_mean(wt) %>%
step_impute_median(cyl) %>%
step_impute_mode(am) %>%
prep() %>%
bake(new_data=NULL)
transformed[missing_rows, ] %>%
knitr::kable(digits=3) %>%
scroll_box(width = "100%")
transformed <- recipe(mpg ~ ., data=data) %>%
step_impute_linear(wt, impute_with=imp_vars(disp, hp)) %>%
prep() %>%
bake(new_data=NULL)
transformed[missing_wt, ] %>%
knitr::kable(digits=3) %>%
scroll_box(width = "100%")
penguins <- readr::read_csv("data/penguins_modified.csv.gz") %>%
sample_frac()
recipe(~ species, data=penguins) %>%
step_dummy(species, keep_original_cols=TRUE) %>%
prep() %>%
bake(new_data=NULL) %>%
head()
recipe(~ species, data=penguins) %>%
step_dummy(species, one_hot=TRUE, keep_original_cols=TRUE) %>%
prep() %>%
bake(new_data=NULL) %>%
head()
knitr::include_graphics("images/preprocess_dummy.png")
transformed <- recipe(mpg~vs+hp, data=data) %>%
step_dummy(vs, one_hot=TRUE) %>%
step_interact(~ starts_with("vs"):hp) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head(3)
transformed <- recipe(mpg~vs+am, data=data) %>%
step_dummy(vs, am, one_hot=TRUE) %>%
step_interact(~ starts_with("vs"):starts_with("am")) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + wt + vs, data=data) %>%
step_interact(mpg ~ .^2) %>%
# step_interact(mpg ~ .:.) %>%
# step_interact(~ all_numeric_predictors()**2) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% knitr::kable() %>% scroll_box(width = "100%")
transformed <- recipe(mpg~hp + disp + vs, data=data) %>%
# step_interact(mpg ~ .^2) %>% <= includes vs
# step_interact(mpg ~ .:.) %>% <= includes vs
step_interact(~ all_numeric_predictors()**2) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% kableExtra::kbl(caption="Table created by adding all interaction between numerical predictors") %>% scroll_box(width = "100%")
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
transformed <- recipe(mpg~., data=data) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), num_comp=3) %>%
prep() %>%
bake(new_data=NULL)
transformed %>% head() %>% knitr::kable(digits=3)