Chapter 14 Model tuning - the basics

In the previous chapters, we learned how to define data preprocessing steps, select a specific model type, train the model and validate its performance. In this chapter, we will learn how to tune our models to get the best performance out of them. Figure 14.1 shows how this step fits into the overall model workflow.
Model tuning using `tune`

Figure 14.1: Model tuning using tune

There are many ways a model can be tuned.

  • Feature engineering: Feature engineering is the process of creating new features from existing features. This can be as simple as replacing a feature with its square root value, or can involve several features like using We will learn more about feature engineering in Section 15.1.

  • Feature selection: Feature selection is the process of selecting the most important features for a given model. We will learn more about feature selection in Section 15.3.

  • Hyperparameter tuning: Hyperparameters are parameters that are not learned during the training process. Instead, they are set before the training process starts. For example, the number of neighbors in a k-nearest neighbor model is a hyperparameter. Hyperparameter tuning is the process of finding the best hyperparameter values for a given model type. We will learn more about hyperparameter tuning in this Chapter and in Section 15.4.

  • Threshold selection: Threshold selection is the process of finding the best threshold for a given classification model. This is currently not included in the workflow package, but can be done post-training using the probably package (see Section 11.1.2).

Load the packages we need for this chapter.

Code
library(tidymodels)

14.1 Specifying tunable parameters

In the tidymodels framework, the tune package has the task of optimizing models using tuning. To do this, it needs to know what should be tuned. We can specify tunable parameters in the preprocessing steps defined using recipe objects and in the model definition step using parsnip objects.

For example, for principal component regression, we want to select the number of principal components to use in the regression model. The step_pca function has the num_comp argument that specifies the number of principal components to use. If we want to find the optimal value of num_comp, we specify this as a tuning parameter using the tune() function in the recipe.

Code
mtcars_rec <- recipe(mpg ~ ., data = mtcars) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp = tune())

Another example, is the number of neighbors in a \(k\)-nearest neighbor model. We can specify this number in the nearest_neighbor() function using the neighbors argument.

Code
model <- nearest_neighbor(mode="regression", neighbors=tune())

By combining the recipe and model into a workflow, we define a k-nearest neighbor model using principal components where we optimize both the number of components and the number of neighbors.

Code
mtcars_workflow <- workflow() %>%
    add_recipe(mtcars_rec) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
name id source component component_id object
neighbors neighbors model_spec nearest_neighbor main integer , 1 , 15 , TRUE , TRUE , # Nearest Neighbors
num_comp num_comp recipe step_pca pca_M30QG integer, 1, 4, TRUE, TRUE, # Components, function (object, x, log_vals = FALSE, …) , {, check_param(object), rngs <- range_get(object, original = FALSE), if (!is_unknown(rngs\(upper)) {, return(object), }, x_dims <- dim(x), if (is.null(x_dims)) {, rlang::abort("Cannot determine number of columns. Is `x` a 2D data object?"), }, if (log_vals) {, rngs[2] <- log10(x_dims[2]), }, else {, rngs[2] <- x_dims[2], }, if (object\)type == “integer” & is.null(object$trans)) {, rngs <- as.integer(rngs), }, range_set(object, rngs), }

The extract_parameter_set_dials function tells us that the workflow has two tunable parameters: num_comp and neighbors. The table also informs us where the parameters are used: neighbors is used in the nearest_neighbor function and num_comp is used in the step_pca function. Most importantly, it identifies the type of tuning that is used for each parameter. The name column tells us the function from the dials package that is used to optimize the parameter; see the dials documentation for more information.

The object column contains the actual definition of the search space of the parameter. Let’s have a look at the default settings.

Code
parameters$object[1]
## [[1]]
## # Nearest Neighbors (quantitative)
## Range: [1, 15]
Code
parameters$object[2]
## [[1]]
## # Components (quantitative)
## Range: [1, 4]

The number of components in the PCA can range between 1 and 4, the number of neighbors between 1 and 15. This may not be suitable for our problem and it often isn’t. We can change the settings for one or more of the tunable parameters using the update function.

Code
parameters <- parameters %>%
    update(
        num_comp = num_comp(c(1, 10)),
        neighbors = neighbors(c(1, 5)),
    )

This increases the search range for number of components from 1 to 10 and reduces the search space for neighbors from 1 to 5.

Useful to know:

If you have multiple tuning parameters in your workflow of the same type, you identify them using an identifier in the tune function, e.g. tune("pca1") and tune("pca2"). You can then modify the default settings for each of them separately, e.g. 

parameter_set <- parameter_set %>%
  update(
    pca1 = num_comp(1, 10),
    pca2 = num_comp(1, 5),
  )

14.2 Data-specific tuning parameters

Some tuning parameters have a data-specific component. Here is an example where we tune the mtry parameter, the number of variables to consider at each split, in a random forest model.

Code
wf <- workflow() %>%
    add_recipe(mtcars_rec) %>%
    add_model(rand_forest(mtry=tune(), mode="regression"))
p <- extract_parameter_set_dials(wf)
p
## Collection of 2 parameters for tuning
## 
##  identifier     type    object
##        mtry     mtry nparam[?]
##    num_comp num_comp nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.

The output tells us that the mtry parameter requires more specification either using dials::finalize or dials::update_parameters. We can find out more about this by inspecting p$object[[1]].

Code
p$object[[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]

The range for the mtry parameter is defined as [1, ?] where ? indicates that the upper bound is unspecified. If we use the parameters as they are, tuning will throw an error.

grid_regular(p)

Error in `grid_regular()`:
! These arguments contain unknowns: `mtry`.
ℹ See the `finalize()` function.

We need to specify the upper bound for the mtry parameter. We can do this using the dials::finalize function. The finalize function uses the dataset to determine the upper bound for the parameter.

Code
p <- extract_parameter_set_dials(wf) %>%
    finalize(mtcars %>% select(-mpg))
p$object[[1]]
## # Randomly Selected Predictors (quantitative)
## Range: [1, 10]

The range for the mtry parameter is now defined as [1, 10]. Alternatively, you can specify the range using the update function as described in the previous section.

14.3 Tuning a workflow

Now that we have specified the tuning parameters, we can use them to search our parameter space to identify the best model.

Code
set.seed(123)
tune_results <- tune_grid(mtcars_workflow, 
                          resamples=vfold_cv(mtcars),
                          grid=grid_regular(parameters))

The tune_grid function requires the workflow, information about the vaidation strategy (resamples) and the search space (grid). With the exception of the grid argument, this is very similar to the fit_resamples function that we encountered in Chapter 13. For each combination of tuning parameters defined in the grid argument, the tune_grid function trains a model and evaluates its performance (see Chapter 13).

The grid argument specifies the search space for the tuning parameters. In this case, we use a regular grid search. We will learn more about grid search strategies in Section 14.4.

Printing the tune_results object is not very informative. The output will only tell us that it’s a tibble with tuning results for a 10-fold cross valiation. The tibble contains the tuning parameters and the performance metrics for each model and cross-validation fold.

You can visualize the results using the autoplot function:

Code
autoplot(tune_results)
Visualization of tuning results for workflow tuning

Figure 14.2: Visualization of tuning results for workflow tuning

Figure 14.2 shows the results of the tuning process. The \(x\)-axis shows the number of neighbors in the \(k\)-nearest neighbor model and the \(y\)-axis the values of the two calculated metrics, rmse and rsq. The color of the points corresponds to the number of components in the PCA. Points with the same number of components are connected by a line. The best model is the one with the lowest RMSE. We can see that the best model has one component in the PCA and five neighbors in the \(k\)-nearest neighbor model.

The collect_metrics function can be used to extract the performance metrics from the tuning results.

Code
collect_metrics(tune_results) %>% head(7)
## # A tibble: 7 × 8
##   neighbors num_comp .metric .estimator  mean     n std_err .config             
##       <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1         1        1 rmse    standard   3.23     10  0.339  Preprocessor1_Model1
## 2         1        1 rsq     standard   0.817    10  0.0770 Preprocessor1_Model1
## 3         3        1 rmse    standard   2.55     10  0.233  Preprocessor1_Model2
## 4         3        1 rsq     standard   0.839    10  0.0910 Preprocessor1_Model2
## 5         5        1 rmse    standard   2.47     10  0.106  Preprocessor1_Model3
## 6         5        1 rsq     standard   0.841    10  0.0878 Preprocessor1_Model3
## 7         1        5 rmse    standard   3.32     10  0.571  Preprocessor2_Model1

The function determines the mean and standard deviation for each metric across the 10 folds.

The show_best function selects a specific metric and shows the five best model for that metric.

Code
show_best(tune_results, metric="rmse")
## # A tibble: 5 × 8
##   neighbors num_comp .metric .estimator  mean     n std_err .config             
##       <int>    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1         5        1 rmse    standard    2.47    10   0.106 Preprocessor1_Model3
## 2         3        1 rmse    standard    2.55    10   0.233 Preprocessor1_Model2
## 3         3       10 rmse    standard    3.05    10   0.364 Preprocessor3_Model2
## 4         3        5 rmse    standard    3.15    10   0.551 Preprocessor2_Model2
## 5         1       10 rmse    standard    3.16    10   0.549 Preprocessor3_Model1

Finally, the select_best function selects the best model based on a specific metric.

Code
best_parameters <- select_best(tune_results, metric="rmse") %>% select(-.config)
best_parameters
## # A tibble: 1 × 2
##   neighbors num_comp
##       <int>    <int>
## 1         5        1

The best_parameters object contains the best parameters for each step in the workflow. We can use this object to finalize the workflow.

Code
best_workflow <- mtcars_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(mtcars)
best_workflow
## ══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_normalize()
## • step_pca()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5L,     data, 5))
## 
## Type of response variable: continuous
## minimal mean absolute error: 2.104075
## Minimal mean squared error: 5.900349
## Best kernel: optimal
## Best k: 5

The finalize_workflow function trains the workflow using the best parameters and the full training set. The resulting workflow can be used to make predictions on new data.

Code
df <- tibble(
    actual= mtcars$mpg,
    predicted=predict(best_workflow, new_data=mtcars)$.pred
)

ggplot(df, aes(x=actual, y=predicted)) +
    geom_point() +
    geom_abline()
Visualization of tuning results for workflow tuning

Figure 14.3: Visualization of tuning results for workflow tuning

Figure 14.3 shows the actual versus predicted values for the best model. The model seems to do a good job at predicting the actual values. However, this is the prediction on the training set and we obviously need to assess the model performance on a separate holdout set.

14.4 Grid search strategies

The tune package implements a variety of approaches to search the parameter space.

  • grid_regular: we used this function in the previous section; it searches the parameter space using a systematic grid of parameter values. In design of experiments, this is known as a full factorial design.
  • grid_random: this approach randomly samples combinations of parameter values
  • grid_latin_hypercube: Latin hypercube sampling is an approach that comes from the field of experimental design; it is a type of space filing design. It is similar to random sampling, but ensures that the parameter space is sampled more evenly.
  • grid_max_entropy: maximum entropy sampling also aims to cover the search space evenly. It is another example of a space filling design.
Figure 14.4 shows the difference between the different methods using an example with two continuous parameters, penalty and mixtures, the tuning parameters of a penalized logistic regression model.
Model tuning using `tune`

Figure 14.4: Model tuning using tune

The regular grid splits each parameter range into a fixed number of points, here 10, and enumerates all possible combinations of the parameters. This results in 10 x 10 = 100 models. The random grid randomly samples 30 combinations of the parameters. We can see that there are areas of the parameters space that are not well covered. The Latin hypercube and maximum entropy sampling approaches also randomly sample 30 combinations of the parameters, but in this case, there seems to be a more uniform coverage of the parameter space.

Let’s compare the different search strategies using a real example. We will use the mtcars dataset and try to predict the fuel efficiency of cars using a linear regression model. We will use the glmnet engine which allows us to tune L1 and L2 regularization using the penalty and mixture parameters in the linear regression model. The code is hidden, click the black triangle to show it.

Code
set.seed(123)

recipe <- recipe(mpg ~ ., data = mtcars) %>%
  step_normalize(all_numeric_predictors())
model <- nearest_neighbor(mode="regression", neighbors=tune())
model <- linear_reg(mode="regression", penalty=tune(), mixture=tune()) %>%
    set_engine("glmnet")
wf <- workflow() %>%
    add_recipe(recipe) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(wf) %>% 
    update(
        penalty=penalty(c(-3, 0.75))
    )
resamples <- vfold_cv(mtcars)
model_regular <- tune_grid(wf, resamples=resamples, 
                           grid=grid_regular(parameters, levels=10))
nrandom = 30
model_random <- tune_grid(wf, resamples=resamples,
                          grid=grid_random(parameters, size=nrandom))
model_latin_hypercube <- tune_grid(wf, resamples=resamples,
                          grid=grid_latin_hypercube(parameters, size=nrandom))
model_max_entropy <- tune_grid(wf, resamples=resamples,
                          grid=grid_max_entropy(parameters, size=nrandom))
df <- rbind(
    cbind(grid='grid_regular', tested=100, 
          model_regular %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_random', tested=nrandom, 
          model_random %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_latin_hypercube', tested=nrandom, 
          model_latin_hypercube %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_max_entropy', tested=nrandom, 
          model_max_entropy %>% show_best(metric='rmse', n=1))
    ) %>% 
    select(-c(.metric, .estimator, n, .config)) %>%
    mutate(
        grid=factor(grid, levels=c('grid_regular', 'grid_random', 
                                  'grid_latin_hypercube', 'grid_max_entropy'))
    )

The following table shows the best model for each search strategy.

grid tested penalty mixture RMSE std_err
grid_regular 100 0.825 1.0000 2.48 0.447
grid_random 30 1.654 0.0705 2.51 0.411
grid_latin_hypercube 30 1.851 0.1087 2.52 0.421
grid_max_entropy 30 1.201 0.7782 2.53 0.466

Superficially, it seems that the regular grid search resulted in the best model. However, the difference to the other models is very small and well within the observed variation of the estimated cross-validation performance. Figure 14.5 shows the mean and standard deviation of the RMSE for each search strategy.

Code
ggplot(df, aes(x=grid, y=mean, ymin=mean-std_err, ymax=mean+std_err)) +
    geom_point() +
    geom_pointrange() +
    xlab('Method to define grid') + ylab('Mean rmsq error')
Comparison of different search strategies

Figure 14.5: Comparison of different search strategies

It is also important to stress, that the regular grid search is much more computationally expensive than the other approaches. The regular grid search tested 100 models, the other methods tested only 30 different models. It also seems as if the best models are obtained for a mixture value of 1, i.e. a pure lasso model. If the best value in the search space is at the boundary of the search space, it is likely that random methods will not find it. If you think this is the case, remove the parameter from tuning and fix the value in the model definition.

Useful to know:

In the literature, the regular grid search is usually taught as the default approach to tuning. However, it is not necessarily the best approach. The random approaches are often a better choice. They can be more efficient requiring fewer model evaluations.

14.5 Bayesian Hyperparameter optimization

The tune package also implements Bayesian hyperparameter optimization. Bayesian optimization is a sequential approach to hyperparameter optimization. To initialize the method, random sampling is used to get a rough exploration of the parameter space. Using the validation results, a Bayesian model, a Gaussian process model to be precise, is trained using the parameter combinations as predictors and the validation performance as the outcome. The Gaussian process model then predicts the validation performance for a large number of parameter combinations. This value combined with the uncertainty of the prediction is used to prioritize the next parameter combination finding a trade-off between global exploration and local exploitation. The most useful combination is then evaluated and the process is repeated with the old and new data until a stopping criteria is met (maximum number of iterations or no improvement).

Bayesian hyperparameter optimization is available with the tune_bayes function.

Code
model_bayes <- tune_bayes(wf, resamples=resamples, 
                          param_info=parameters, iter=25)

With these settings, the function will create an initial model with 5 evaluations followed by 25 iterations. The param_info argument is used to specify the search space. The iter argument specifies the number of iterations. The tune_bayes function returns a tune_results object that can be used in the same way as the tune_grid function.

Figure 14.6 demonstrates the process of the Bayesian hyperparameter optimization.

Code
regular_metrics <- model_regular %>% collect_metrics() %>% filter(.metric=='rmse')
bayes_metrics <- model_bayes%>% collect_metrics() %>% filter(.metric=='rmse')

ggplot(regular_metrics, aes(x=penalty, y=mixture, z=mean)) +
    geom_point(data=df, color='red', size=3) +
    geom_contour(color='black', alpha=0.75, bins=20) +
    geom_point(data=bayes_metrics %>% head(5), color='#6CABBC') +
    geom_point(data=bayes_metrics %>% tail(-5), color='#226DCE') +
    geom_point(data=model_bayes %>% show_best(n=1, metric='rmse'), pch=21, color='white', fill='#0F386C', size=3) +
    scale_x_log10() 
Exploration of parameter space during Bayesian hyperparameter tuning for a penalized linear regression model.

Figure 14.6: Exploration of parameter space during Bayesian hyperparameter tuning for a penalized linear regression model.

The plot shows the parameter combinations that were evaluated during the optimization process. The contour plot was determined from the validation performance calculated for the regular grid search. The blue points show the parameter combinations that were evaluated during the Bayesian hyperparameter optimization. Light blue points were evaluated in the initial phase, the darker blue points were evaluated in the subsequent iterations. The dark blue point highlights the best model found during these iterations. The red point shows the best results from the above grid searches for comparison.

Figure 14.6 clearly shows that the Bayesian hyperparameter optimization fullfills two objectives, exploration by evaluating parameter combinations in areas of the parameter space that have not been explored before and exploitation by evaluating parameter combinations that are close to the best parameter combination found so far.

Further information:

There are more packages that extend the capabilities of the tune package.

  • finetune package implements iterative methods similar to the Bayesian hyperparameter optimization approach, e.g. simulated annealing for global optimization

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_tune.png")
library(tidymodels)
mtcars_rec <- recipe(mpg ~ ., data = mtcars) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_pca(all_numeric_predictors(), num_comp = tune())
model <- nearest_neighbor(mode="regression", neighbors=tune())
mtcars_workflow <- workflow() %>%
    add_recipe(mtcars_rec) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
parameters$object[1]
parameters$object[2]
parameters <- parameters %>%
    update(
        num_comp = num_comp(c(1, 10)),
        neighbors = neighbors(c(1, 5)),
    )
wf <- workflow() %>%
    add_recipe(mtcars_rec) %>%
    add_model(rand_forest(mtry=tune(), mode="regression"))
p <- extract_parameter_set_dials(wf)
p
p$object[[1]]
p <- extract_parameter_set_dials(wf) %>%
    finalize(mtcars %>% select(-mpg))
p$object[[1]]
set.seed(123)
tune_results <- tune_grid(mtcars_workflow, 
                          resamples=vfold_cv(mtcars),
                          grid=grid_regular(parameters))
autoplot(tune_results)
collect_metrics(tune_results) %>% head(7)
show_best(tune_results, metric="rmse")
best_parameters <- select_best(tune_results, metric="rmse") %>% select(-.config)
best_parameters
best_workflow <- mtcars_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(mtcars)
best_workflow
df <- tibble(
    actual= mtcars$mpg,
    predicted=predict(best_workflow, new_data=mtcars)$.pred
)

ggplot(df, aes(x=actual, y=predicted)) +
    geom_point() +
    geom_abline()
knitr::include_graphics("images/tuning-grids.png")
set.seed(123)

recipe <- recipe(mpg ~ ., data = mtcars) %>%
  step_normalize(all_numeric_predictors())
model <- nearest_neighbor(mode="regression", neighbors=tune())
model <- linear_reg(mode="regression", penalty=tune(), mixture=tune()) %>%
    set_engine("glmnet")
wf <- workflow() %>%
    add_recipe(recipe) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(wf) %>% 
    update(
        penalty=penalty(c(-3, 0.75))
    )
resamples <- vfold_cv(mtcars)
model_regular <- tune_grid(wf, resamples=resamples, 
                           grid=grid_regular(parameters, levels=10))
nrandom = 30
model_random <- tune_grid(wf, resamples=resamples,
                          grid=grid_random(parameters, size=nrandom))
model_latin_hypercube <- tune_grid(wf, resamples=resamples,
                          grid=grid_latin_hypercube(parameters, size=nrandom))
model_max_entropy <- tune_grid(wf, resamples=resamples,
                          grid=grid_max_entropy(parameters, size=nrandom))
df <- rbind(
    cbind(grid='grid_regular', tested=100, 
          model_regular %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_random', tested=nrandom, 
          model_random %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_latin_hypercube', tested=nrandom, 
          model_latin_hypercube %>% show_best(metric='rmse', n=1)),
    cbind(grid='grid_max_entropy', tested=nrandom, 
          model_max_entropy %>% show_best(metric='rmse', n=1))
    ) %>% 
    select(-c(.metric, .estimator, n, .config)) %>%
    mutate(
        grid=factor(grid, levels=c('grid_regular', 'grid_random', 
                                  'grid_latin_hypercube', 'grid_max_entropy'))
    )
df %>% 
    rename(RMSE='mean') %>%
    mutate_if(is.numeric, format, digits=3, nsmall=0) %>%
    knitr::kable()
ggplot(df, aes(x=grid, y=mean, ymin=mean-std_err, ymax=mean+std_err)) +
    geom_point() +
    geom_pointrange() +
    xlab('Method to define grid') + ylab('Mean rmsq error')
model_bayes <- tune_bayes(wf, resamples=resamples, 
                          param_info=parameters, iter=25)
regular_metrics <- model_regular %>% collect_metrics() %>% filter(.metric=='rmse')
bayes_metrics <- model_bayes%>% collect_metrics() %>% filter(.metric=='rmse')

ggplot(regular_metrics, aes(x=penalty, y=mixture, z=mean)) +
    geom_point(data=df, color='red', size=3) +
    geom_contour(color='black', alpha=0.75, bins=20) +
    geom_point(data=bayes_metrics %>% head(5), color='#6CABBC') +
    geom_point(data=bayes_metrics %>% tail(-5), color='#226DCE') +
    geom_point(data=model_bayes %>% show_best(n=1, metric='rmse'), pch=21, color='white', fill='#0F386C', size=3) +
    scale_x_log10()