Chapter 15 Model tuning - examples

In the previous chapters, we learned how to define and tune hyperparameters using the tidymodels packages dials and tune. In this chapter, we will go deeper into specific areas of model tuning. Figure 15.1 shows how this step fits into the overall model workflow.
Model tuning using `tune`

Figure 15.1: Model tuning using tune

Load the packages we need for this chapter.

Code
library(tidymodels)
library(tidyverse)
library(patchwork)

Because tuning requires training many models, we also enable parallel computing.

Code
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)

15.1 Feature engineering

We already saw in Chapter 14 that preprocessing steps have tunable parameters. The function tunable() can be used to identify tunable parameters in a recipe. For example, the following code shows how to identify tunable parameters in a recipe that contains a polynomial step.

Code
# tidymodels is very picky about data types and will complain when we
# predict on new data if the age value is not an integer. We therefore
# convert here age to double
data <- ISLR2::Wage %>%
    mutate(age = as.double(age))

recipe(wage ~ age, data=data) %>%
    step_poly(hp) %>%
    step_discretize(hp) %>%
    step_cut(hp, breaks=tune()) %>%
    step_bs(hp) %>%
    tunable()
## # A tibble: 5 × 5
##   name       call_info        source component       component_id    
##   <chr>      <list>           <chr>  <chr>           <chr>           
## 1 degree     <named list [2]> recipe step_poly       poly_ZUDru      
## 2 min_unique <named list [2]> recipe step_discretize discretize_bUFT3
## 3 num_breaks <named list [2]> recipe step_discretize discretize_bUFT3
## 4 deg_free   <named list [3]> recipe step_bs         bs_JyJy0        
## 5 degree     <named list [3]> recipe step_bs         bs_JyJy0

We see that the polynomial step has a tunable parameter degree.

15.1.1 Polynomial regression

With this information, we can tune a polynomial regression model.

Code
set.seed(123)
poly_recipe <- recipe(wage ~ age, data=data) %>%
    step_poly(age, degree=tune())
model <- linear_reg(mode="regression") %>%
    set_engine("glm")
poly_workflow <- workflow() %>%
    add_recipe(poly_recipe) %>%
    add_model(model)
tune_results <- tune_grid(poly_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
## # A tibble: 3 × 7
##   degree .metric .estimator  mean     n std_err .config             
##    <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1      3 rmse    standard    39.7    10    1.30 Preprocessor3_Model1
## 2      2 rmse    standard    39.8    10    1.30 Preprocessor1_Model1
## 3      1 rmse    standard    40.8    10    1.28 Preprocessor2_Model1

By default, this will tune the polynomial degree from 1 to 3. The best model was with a quadratic polynomial. We can use the finalize_workflow() function to fit the best model to the entire data set.

Code
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- poly_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)

Figure 15.2 shows the best fit of the polynomial regression model. The shaded area shows the 95% confidence interval of the fit. The polynomial regression model is a linear model, but the polynomial step allows us to fit a non-linear relationship between the predictor and the outcome.

Code
df <- tibble(age = seq(min(data$age), max(data$age), length.out = 100))
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
Polynomial regression model

Figure 15.2: Polynomial regression model

15.1.2 Step function regression

In this section, we develop a model using a step function. The step_discretize() function is used to convert the age features into bins. The tunable parameter is num_breaks.

Code
set.seed(123)
step_recipe <- recipe(wage ~ age, data=data) %>%
    step_discretize(age, num_breaks=tune())
step_workflow <- workflow() %>%
    add_recipe(step_recipe) %>%
    add_model(model)
tune_results <- tune_grid(step_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
## # A tibble: 5 × 7
##   num_breaks .metric .estimator  mean     n std_err .config             
##        <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1          5 rmse    standard    40.3    10    1.27 Preprocessor6_Model1
## 2          4 rmse    standard    40.3    10    1.25 Preprocessor2_Model1
## 3          7 rmse    standard    40.8    10    1.28 Preprocessor1_Model1
## 4          6 rmse    standard    40.8    10    1.28 Preprocessor3_Model1
## 5          8 rmse    standard    40.8    10    1.28 Preprocessor4_Model1

Tuning determines that the best model uses 5 bins. We can use the finalize_workflow() function to fit the best model to the entire data set.

Code
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- step_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)

Figure 15.3 shows the best fit of the regression model using a step function.

Code
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
cuts <- tibble(breaks=quantile(data$age, probs=seq(0, 1, by=1 / best_breaks)))
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_vline(aes(xintercept=breaks), data=cuts, color="darkgreen", alpha=0.5) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
Stepwise regression model; the green lines show the bin boundaries.

Figure 15.3: Stepwise regression model; the green lines show the bin boundaries.

The model developed in this section is different from the step function model described in Sections 7.2 and 7.8.1 of An introduction to statistical learning (James et al. 2021). In the book, the authors use cut(age, 4) to create a categorical variable with four levels. This leads to different boundaries than defined by the quantiles that the step_discretize() function uses. If you want to use defined boundaries, e.g. split into age groups of 10 years, you can use

step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)

The breaks are however not tunable in this case. Figure 15.4 shows the results of this model.

Code
step_recipe <- recipe(wage ~ age, data=data) %>%
    step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)
step_workflow <- workflow() %>%
    add_recipe(step_recipe) %>%
    add_model(model)
trained <- step_workflow %>% fit(data)
cuts <- tibble(breaks=seq(20, 70, by=10))
df %>%
    bind_cols(
        predict(trained, new_data=df),
        predict(trained, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_vline(aes(xintercept=breaks), data=cuts, color="darkgreen", alpha=0.5) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
Stepwise regression model with fixed bin boundaries.

Figure 15.4: Stepwise regression model with fixed bin boundaries.

15.1.3 Spline regression

Next we will tune a model using a spline representation for age.

Code
set.seed(123)
spline_recipe <- recipe(wage ~ age, data=data) %>%
    step_bs(age, degree=tune(), deg_free=tune())
spline_workflow <- workflow() %>%
    add_recipe(spline_recipe) %>%
    add_model(model)
tune_results <- tune_grid(spline_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
## # A tibble: 5 × 8
##   deg_free degree .metric .estimator  mean     n std_err .config             
##      <int>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1        6      2 rmse    standard    39.7    10    1.30 Preprocessor6_Model1
## 2        8      1 rmse    standard    39.7    10    1.28 Preprocessor8_Model1
## 3        4      2 rmse    standard    39.7    10    1.30 Preprocessor1_Model1
## 4       14      1 rmse    standard    39.7    10    1.27 Preprocessor7_Model1
## 5       10      2 rmse    standard    39.8    10    1.29 Preprocessor4_Model1
Code
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- spline_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)

Figure 15.5 shows the best fit of the spline regression model.

Code
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
    ggplot(aes(x=age, y=.pred)) +
      geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
      geom_line() +
      geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
      labs(x="Age", y="Wage")
Spline regression model

Figure 15.5: Spline regression model

Other examples of preprocessing steps that can be tuned for feature engineering are:

  • step_pca using the num_comp parameter
  • step_nzv to remove variables that are highly sparse and unbalanced using the freq_cut or unique_cut parameters

15.2 Regularization

Regularization is used in several machine learning methods. In general, it is any approach that controls the complexity of a model. For linear models, there are two common types of regularization. Both control the complexity of the model by penalizing the size of the coefficients.

  • L2-regularization (ridge penalty) \[ \textrm{Penalty}_{L2}(\beta, \lambda) = \lambda \sum_{j=1}^p \left|\beta_j \right|^2 = \beta^T\beta \]

  • L1-regularization (lasso penalty) \[ \textrm{Penalty}_{L1}(\beta, \lambda) = \lambda \sum_{j=1}^p \left|\beta_j \right| \]

Both approaches differ in how they penalize the size of the coefficients. The L2-regularization penalizes the sum of the squared coefficients, whereas the L1-regularization penalizes the sum of the absolute values of the coefficients. The added penalty causes the coefficients to shrink towards zero. The amount of shrinkage is controlled by the regularization parameter \(\lambda\). The larger \(\lambda\), the more the coefficients are shrunk towards zero. Figure 15.6 demonstrates the difference between the two regularization approaches. In L2 regularization, the coefficients are shrunk towards zero, but they are may never be exactly zero. In L1 regularization, the coefficients can be exactly zero. This means that L1 regularization can be used for feature selection.
Contours of the error and constraint functions for lasso (left) and ridge (right) regularization. The green areas are the constraint regions, $|\beta_1| + |\beta_2| \leq t$ and $\beta_1^2 + \beta_2^2 \leq t^2$, while the blue ellipses are the contours of the RSS (residual sum of squares). The optimal coefficients are the points where the ellipses touch the constraint regions (red).

Figure 15.6: Contours of the error and constraint functions for lasso (left) and ridge (right) regularization. The green areas are the constraint regions, \(|\beta_1| + |\beta_2| \leq t\) and \(\beta_1^2 + \beta_2^2 \leq t^2\), while the blue ellipses are the contours of the RSS (residual sum of squares). The optimal coefficients are the points where the ellipses touch the constraint regions (red).

In some cases, it is useful to combine the L1 and L2 regularization. This is called elasticnet regularization.11 The elasticnet penalty is a weighted combination of the L1 and L2 penalties. The weight \(\alpha\) controls the relative contribution of the L1 and L2 penalties. When \(\alpha=0\), the elasticnet penalty is equivalent to the L2 penalty. When \(\alpha=1\), the elasticnet penalty is equivalent to the L1 penalty. \[ P_{elastic}(\beta) = \alpha P_{L1}(\beta, \lambda) + \frac12 (1-\alpha) P_{L2}P(\beta, \lambda) \]

The glmnet package implements both L1 and L2 regularization as well as the elasticnet penalty. We’ve already seen in Section 14.4 how we can explore the penalty and mixture hyperparameters with tidymodels. See Chapter 21 for more details.

15.3 Feature selection

Many text books cover methods like forward and backward stepwise selection and best subset selection. The authors of tidymodels do not recommended these approaches for feature selection. It’s unlikely that they will add these methods to tidymodels in the future. Instead, they suggest using regularization, in particular L1 regularization, or other methods. In fact, they work on a new package called colino that implements several approaches to feature selection.

The colino package is still in development and is not yet available on CRAN. You can find the development version on GitHub and install it with the following code.

Code
if (!require(colino)) {
    devtools::install_github("stevenpawley/colino", force=TRUE)
}
library(colino)

The methods are all implemented as steps compatible with the recipe package.

Todo:

Look through the package documentation at https://stevenpawley.github.io/colino/ to get an overview of the various methods.

The following code shows how to use the step_select_forests() function to select features based on variable importance derived from a random forest model. The actual model will be a \(k\)-nearest neighbor model.

Code
set.seed(123)
resamples <- vfold_cv(mtcars)
rec_select <- recipe(mpg ~ ., data=mtcars) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_select_forests(all_predictors(), outcome="mpg", top_p=tune())
model <- nearest_neighbor(mode="regression", neighbors=tune())
mtcars_workflow <- workflow() %>%
    add_recipe(rec_select) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
name id source component component_id object
neighbors neighbors model_spec nearest_neighbor main integer , 1 , 15 , TRUE , TRUE , # Nearest Neighbors
top_p top_p recipe step_select_forests select_forests_iOjwA integer, 1, 4, TRUE, TRUE, # Selected Predictors, function (object, x, log_vals = FALSE, …) , {, check_param(object), rngs <- range_get(object, original = FALSE), if (!is_unknown(rngs\(upper)) {, return(object), }, x_dims <- dim(x), if (is.null(x_dims)) {, rlang::abort("Cannot determine number of columns. Is `x` a 2D data object?"), }, if (log_vals) {, rngs[2] <- log10(x_dims[2]), }, else {, rngs[2] <- x_dims[2], }, if (object\)type == “integer” & is.null(object$trans)) {, rngs <- as.integer(rngs), }, range_set(object, rngs), }

The step_select_forests() function has various tunable parameters. Here, we use top_p, the number of most important features to keep.

Now we have everything we need to tune the workflow.12

Code
tune_results <- tune_grid(mtcars_workflow,
                          resamples=resamples,
                          grid=grid_random(parameters, size=50))
tune_results %>%
    show_best(metric="rmse") %>%
    select(-.config) %>%
    knitr::kable()
neighbors top_p .metric .estimator mean n std_err
6 4 rmse standard 2.341911 10 0.2005014
8 4 rmse standard 2.494542 10 0.2542292
1 4 rmse standard 2.515561 10 0.2915867
9 4 rmse standard 2.565404 10 0.2747478
3 3 rmse standard 2.588270 10 0.3163073

The best model selected 4 features and 6 for the number of neighbors in the \(k\)-NN model. We can use the finalize_workflow() function to fit the best model to the entire data set and visualize the results, see Figure 15.7.

Code
best_parameters <- select_best(tune_results, metric="rmse")
best_workflow <- mtcars_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(mtcars)
Code
mtcars %>%
    bind_cols(
        predict(best_workflow, new_data=mtcars)
    ) %>%
    ggplot(aes(x=mpg, y=.pred)) +
        geom_point() +
        geom_abline() +
        xlim(10, 35) + ylim(10, 35) +
        labs(x="Observed mpg", y="Predicted mpg")
Feature selection using random forest variable importance

Figure 15.7: Feature selection using random forest variable importance

We can also extract the recipe and look at the results from the feature importance.

Code
best_workflow %>%
    extract_recipe() %>%
    tidy(number = 2, type = "scores")
## # A tibble: 10 × 3
##    variable  score id                  
##    <chr>     <dbl> <chr>               
##  1 wt       100    select_forests_iOjwA
##  2 disp      80.3  select_forests_iOjwA
##  3 cyl       74.8  select_forests_iOjwA
##  4 hp        63.8  select_forests_iOjwA
##  5 carb      12.8  select_forests_iOjwA
##  6 drat       9.67 select_forests_iOjwA
##  7 vs         4.65 select_forests_iOjwA
##  8 qsec       4.37 select_forests_iOjwA
##  9 am         3.50 select_forests_iOjwA
## 10 gear       0    select_forests_iOjwA

The top-4 features used in the final model are wt, disp, cyl, hp.

15.4 Hyperparameter tuning

We already covered hyperparameter tuning in Chapter 14, so what else could there be said? Let’s revisit the example from Chapter 13. In that chapter, we compared the performance of a \(k\)-nearest neighbor model and a logistic regression model to predict the probability of a customer taking a loan. Comparing the ROC curves showed strange behavior of the \(k\)-NN model. It’s likely that our model requires further tuning.

Let’s start with loading and preprocessing the data.

Code
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/UniversalBank.csv.gz")
data <- data %>%
    select(-c(ID, `ZIP Code`)) %>%
    rename(
        Personal.Loan = `Personal Loan`,
        Securities.Account = `Securities Account`,
        CD.Account = `CD Account`
    ) %>%
    mutate(
        Personal.Loan = factor(Personal.Loan, labels=c("Yes", "No"), levels=c(1, 0)),
        Education = factor(Education, labels=c("Undergrad", "Graduate", "Advanced")),
    )

Next we repeat the model training from Chapter 13. We use 10-fold crossvalidation to estimate the performance of the models.

Code
set.seed(1353)
folds <- vfold_cv(data, strata=Personal.Loan)

formula <- Personal.Loan ~ Age + Experience + Income + Family + CCAvg + Education +
                           Mortgage + Securities.Account + CD.Account + Online +
                           CreditCard

# Train the logistic regression model
logreg_wf <- workflow() %>%
    add_model(logistic_reg() %>% set_engine("glm")) %>%
    add_formula(formula)
logreg_cv <- logreg_wf %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
logreg_cv_predictions <- collect_predictions(logreg_cv)

# Train the nearest-neighbor model with 5 neighbors
nn5_wf <- workflow() %>%
    add_model(nearest_neighbor(neighbors=5) %>%
              set_mode("classification") %>%
              set_engine("kknn")) %>%
    add_formula(formula)
nn5_cv <- nn5_wf %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn5_cv_predictions <- collect_predictions(nn5_cv)

Figure 15.10 shows the ROC curves for the logistic regression and the 5-NN model. The logistic regression model performs better than the 5-NN model. The 5-NN model has a strange shape, which is likely due to the fact that we did not tune the model.

Let’s tune the model with the default settings

Code
# Tune the number of neighbors of the nearest-neighbor model
nn_wf <- workflow() %>%
    add_model(nearest_neighbor(neighbors=tune()) %>%
              set_mode("classification") %>%
              set_engine("kknn")) %>%
    add_formula(formula)
nn_default_tune <- tune_grid(nn_wf, resamples=folds)
autoplot(nn_default_tune)
Tuning results for the $k$-NN model with default settings

Figure 15.8: Tuning results for the \(k\)-NN model with default settings

Figure 15.8 shows the results of the tuning process. There are two interesting observations to point out. First, the selected number of neighbors depends on the metric. If we use accuracy, the optimal number of neighbors is 5, the same value we used in our initial model. For ROC AUC, the optimal number of neighbors is 14.

The second observation is that the roc_auc curve has not reached a maximum. This means that we have not explored the entire parameter space. We can increase the number of neighbors to explore the parameter space further.

Code
parameters <- extract_parameter_set_dials(nn_wf) %>%
    update(neighbors = neighbors(c(1, 100)))

nn_bayes_tune <- tune_bayes(nn_wf, resamples=folds, param_info=parameters, iter=25)
## ! No improvement for 10 iterations; returning current results.
Code
autoplot(nn_bayes_tune)
Tuning results for the $k$-NN model with a larger number of neighbors

Figure 15.9: Tuning results for the \(k\)-NN model with a larger number of neighbors

Code
optimal_nn_roc <- nn_bayes_tune %>% select_best(metric="roc_auc")

This time, the hyperparameter search identified the optimal number of neighbors as 48.

Code
cv_roc <- logreg_cv_predictions %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn5_roc <- nn5_cv_predictions %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

nn_default_auc <- nn_wf %>%
    finalize_workflow(select_best(nn_default_tune, metric="roc_auc")) %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_default_auc_roc <- nn_default_auc %>%
    collect_predictions() %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

nn_bayes_auc <- nn_wf %>%
    finalize_workflow(select_best(nn_bayes_tune, metric="roc_auc")) %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_bayes_auc_roc <- nn_bayes_auc %>%
    collect_predictions() %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

g1 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn5_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(a) Initial model (accuracy, k=5)")

g2 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn_default_auc_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(b) Default model (AUC, k=14)")

g3 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn_bayes_auc_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(c) Optimal model (AUC, k=48)")

g1 + g2 + g3
ROC curves for the logistic regression and the 5-NN model (left) and the tuned NN model (right)

Figure 15.10: ROC curves for the logistic regression and the 5-NN model (left) and the tuned NN model (right)

Comparing the three ROC curves, we see that the fully tuned model, with sufficient exploration of the hyperparameter space, performs better than the initial model across the full data range.

Let’s summarize the data:

Code
logreg_metrics <- collect_metrics(logreg_cv)
nn5_metrics <- collect_metrics(nn5_cv)
nn_default_metrics <- collect_metrics(nn_default_auc)
nn_bayes_metrics <- collect_metrics(nn_bayes_auc)

df <- bind_rows(
    logreg_metrics %>% mutate(model="Logistic regression"),
    nn5_metrics %>% mutate(model="Nearest neighbor (k=5)"),
    nn_default_metrics %>% mutate(model="Nearest neighbor (k=14)"),
    nn_bayes_metrics %>% mutate(model="Nearest neighbor (k=48)"),
)
df %>%
    select(model, mean, .metric) %>%
    pivot_wider(names_from=.metric, values_from=mean) %>%
    knitr::kable(digits = 3) %>%
    kableExtra::kable_styling(full_width=FALSE)
model accuracy brier_class roc_auc
Logistic regression 0.958 0.032 0.961
Nearest neighbor (k=5) 0.965 0.028 0.938
Nearest neighbor (k=14) 0.960 0.029 0.966
Nearest neighbor (k=48) 0.947 0.037 0.978
Code
df %>%
    mutate(
        model = factor(model,
            levels=rev(c("Logistic regression", "Nearest neighbor (k=5)",
                        "Nearest neighbor (k=14)", "Nearest neighbor (k=48)")))
    ) %>%
ggplot(aes(x=mean, y=model)) +
    geom_point() +
    geom_pointrange(aes(xmin=mean - std_err, xmax=mean + std_err)) +
    facet_wrap(~ .metric)
Comparison of the performance metrics for the different models

Figure 15.11: Comparison of the performance metrics for the different models

Tuning the hyperparameters of the \(k\)-NN model improved the ROC AUC to larger value than the logistic regression. The accuracy on the other hand dropped. However, as we learned in Chapter 11, accuracy is a performance metric that depends on the threshold used for classification. Using the methods described in Section 11.1.2, we can tweak the thresholds of the logistic regression and the best \(k\)-NN model. In the following example, we use model %>% collect_predictions() to use the out-of-fold predictions from cross-validation with the probably::threshold_perf method.

Code
threshold_graph <- function(model) {
    performance <- probably::threshold_perf(
        model %>% collect_predictions(),
        Personal.Loan, .pred_Yes,
        thresholds=seq(0.05, 0.9, 0.01), event_level="first",
        metrics=metric_set(accuracy, kap, bal_accuracy, f_meas)
    )
    max_values <- performance %>%
        arrange(desc(.threshold)) %>%
        group_by(.metric) %>%
        filter(.estimate == max(.estimate)) %>%
        filter(row_number() == 1)
    g <- ggplot(performance, aes(x=.threshold, y=.estimate, color=.metric)) +
            geom_line() +
            geom_vline(data=max_values, aes(xintercept=.threshold, color=.metric)) +
            scale_x_continuous(breaks=seq(0, 1, 0.1)) +
            coord_cartesian(ylim=c(0.5, 1)) +
            xlab("Threshold") + ylab("Metric value") +
            theme(legend.position="inside", legend.position.inside = c(0.85, 0.75))
    return(g)
}
g1 <- threshold_graph(logreg_cv) + labs(title="Logistic regression model")
g2 <- threshold_graph(nn_bayes_auc) + labs(title="k-nearest neighbor model")

g1 + g2
Metric values for different thresholds of the logistic regression and the $k$-NN model

Figure 15.12: Metric values for different thresholds of the logistic regression and the \(k\)-NN model

Figure 15.12 shows the performance of the logistic regression and the \(k\)-NN model for different thresholds. For all metrics, the threshold dependent maxima are higher for \(k\)-NN compared to logistic regression.

15.5 The one-standard-error rule

Breiman et al (Breiman et al. 1984) suggested that instead of selecting the best model, we should select the simplest model within one standard error of the best model. This is called the one-standard-error rule. The idea is that the best model is likely to be overfitted and that a simpler model is likely to generalize better.

Let’s demonstrate this with a ridge regression model to predict mpg in the mtcars dataset. First tune the model with a suitable range for the penalty parameter.

Code
set.seed(123)
resamples <- vfold_cv(mtcars)
rec <- recipe(mpg ~ ., data=mtcars)
spec <- linear_reg(mode="regression", penalty=tune(), mixture=0) %>%
    set_engine("glmnet")
wf <- workflow() %>%
    add_recipe(rec) %>%
    add_model(spec)
parameters <- extract_parameter_set_dials(wf) %>%
    update(penalty=penalty(c(-2, 2)))
tune_results <- tune_grid(wf,
                          resamples=resamples,
                          grid=grid_regular(parameters, levels=50))

The one-standard-error rule is implemented in the select_by_one_std_err() function. It is used in a similar way to the select_best function.

Code
penalty_best <- select_best(tune_results, metric="rmse")
penalty_1se <- select_by_one_std_err(tune_results, desc(penalty), metric="rmse")

The implementation doesn’t know what a simpler model is, so we need to tell it. In this case, we use the penalty parameter as a proxy for model complexity and inform the model that the complexity decreases with the penalty value.

Figure 15.13 shows the results. The best model is the one with the lowest RMSE. The one-standard-error rule adds one standard error to the minimum and then moves horizontally to the right until it crosses the curve. The penalty value to the left of the crossing is the selected parameter.

Code
penalty_best <- show_best(tune_results, metric="rmse", n=1)
tune_results %>%
    collect_metrics() %>%
    filter(.metric == "rmse") %>%
ggplot(aes(x=penalty, y=mean, ymax=mean + std_err, ymin=mean - std_err)) +
    geom_line() +
    geom_errorbar(color="grey") +
    geom_point() +
    scale_x_log10() +
    coord_cartesian(ylim=c(2, 4)) +
    geom_vline(xintercept=penalty_best$penalty[1], color="blue") +
    geom_vline(xintercept=penalty_1se$penalty[1], color="red") +
    geom_hline(yintercept=penalty_best$mean[1] + penalty_best$std_err[1], color="blue")
Tuning results for ridge regression model to predict mpg with the one-standard-error rule selection of penalty; the _best_ penalty value is indicated by the blue line and the _one-standard-error_ penalty value is indicated by the red line; the horizontal blue line demonstrates the rule.

Figure 15.13: Tuning results for ridge regression model to predict mpg with the one-standard-error rule selection of penalty; the best penalty value is indicated by the blue line and the one-standard-error penalty value is indicated by the red line; the horizontal blue line demonstrates the rule.

Useful to know:

In this section, we learned that it is important to:

  • sufficiently explore the hyperparameter space
  • use the right metric for model selection (ROC)
  • select the threshold after you identified the best model

Further information:

Code
stopCluster(cl)
registerDoSEQ()

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_tune.png")
library(tidymodels)
library(tidyverse)
library(patchwork)
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)
# tidymodels is very picky about data types and will complain when we
# predict on new data if the age value is not an integer. We therefore
# convert here age to double
data <- ISLR2::Wage %>%
    mutate(age = as.double(age))

recipe(wage ~ age, data=data) %>%
    step_poly(hp) %>%
    step_discretize(hp) %>%
    step_cut(hp, breaks=tune()) %>%
    step_bs(hp) %>%
    tunable()
set.seed(123)
poly_recipe <- recipe(wage ~ age, data=data) %>%
    step_poly(age, degree=tune())
model <- linear_reg(mode="regression") %>%
    set_engine("glm")
poly_workflow <- workflow() %>%
    add_recipe(poly_recipe) %>%
    add_model(model)
tune_results <- tune_grid(poly_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- poly_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)
df <- tibble(age = seq(min(data$age), max(data$age), length.out = 100))
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
set.seed(123)
step_recipe <- recipe(wage ~ age, data=data) %>%
    step_discretize(age, num_breaks=tune())
step_workflow <- workflow() %>%
    add_recipe(step_recipe) %>%
    add_model(model)
tune_results <- tune_grid(step_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- step_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
cuts <- tibble(breaks=quantile(data$age, probs=seq(0, 1, by=1 / best_breaks)))
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_vline(aes(xintercept=breaks), data=cuts, color="darkgreen", alpha=0.5) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
step_recipe <- recipe(wage ~ age, data=data) %>%
    step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)
step_workflow <- workflow() %>%
    add_recipe(step_recipe) %>%
    add_model(model)
trained <- step_workflow %>% fit(data)
cuts <- tibble(breaks=seq(20, 70, by=10))
df %>%
    bind_cols(
        predict(trained, new_data=df),
        predict(trained, new_data=df, type="conf_int")
    ) %>%
ggplot(aes(x=age, y=.pred)) +
    geom_vline(aes(xintercept=breaks), data=cuts, color="darkgreen", alpha=0.5) +
    geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
    geom_line() +
    geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
    labs(x="Age", y="Wage")
set.seed(123)
spline_recipe <- recipe(wage ~ age, data=data) %>%
    step_bs(age, degree=tune(), deg_free=tune())
spline_workflow <- workflow() %>%
    add_recipe(spline_recipe) %>%
    add_model(model)
tune_results <- tune_grid(spline_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- spline_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(data)
df %>%
    bind_cols(
        predict(final_model, new_data=df),
        predict(final_model, new_data=df, type="conf_int")
    ) %>%
    ggplot(aes(x=age, y=.pred)) +
      geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
      geom_line() +
      geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
      labs(x="Age", y="Wage")
knitr::include_graphics("images/regularization.png")
if (!require(colino)) {
    devtools::install_github("stevenpawley/colino", force=TRUE)
}
library(colino)
set.seed(123)
resamples <- vfold_cv(mtcars)
rec_select <- recipe(mpg ~ ., data=mtcars) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_select_forests(all_predictors(), outcome="mpg", top_p=tune())
model <- nearest_neighbor(mode="regression", neighbors=tune())
mtcars_workflow <- workflow() %>%
    add_recipe(rec_select) %>%
    add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
tune_results <- tune_grid(mtcars_workflow,
                          resamples=resamples,
                          grid=grid_random(parameters, size=50))
tune_results %>%
    show_best(metric="rmse") %>%
    select(-.config) %>%
    knitr::kable()
nfeatures <- (tune_results %>% show_best(metric="rmse"))$top_p[1]
nneighbors <- (tune_results %>% show_best(metric="rmse"))$neighbors[1]
best_parameters <- select_best(tune_results, metric="rmse")
best_workflow <- mtcars_workflow %>%
    finalize_workflow(best_parameters) %>%
    fit(mtcars)
feature_scores <- best_workflow %>%
    extract_recipe() %>%
    tidy(number = 2, type = "scores")
kept_features <- feature_scores$variable[1:nfeatures]
mtcars %>%
    bind_cols(
        predict(best_workflow, new_data=mtcars)
    ) %>%
    ggplot(aes(x=mpg, y=.pred)) +
        geom_point() +
        geom_abline() +
        xlim(10, 35) + ylim(10, 35) +
        labs(x="Observed mpg", y="Predicted mpg")
best_workflow %>%
    extract_recipe() %>%
    tidy(number = 2, type = "scores")
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/UniversalBank.csv.gz")
data <- data %>%
    select(-c(ID, `ZIP Code`)) %>%
    rename(
        Personal.Loan = `Personal Loan`,
        Securities.Account = `Securities Account`,
        CD.Account = `CD Account`
    ) %>%
    mutate(
        Personal.Loan = factor(Personal.Loan, labels=c("Yes", "No"), levels=c(1, 0)),
        Education = factor(Education, labels=c("Undergrad", "Graduate", "Advanced")),
    )
set.seed(1353)
folds <- vfold_cv(data, strata=Personal.Loan)

formula <- Personal.Loan ~ Age + Experience + Income + Family + CCAvg + Education +
                           Mortgage + Securities.Account + CD.Account + Online +
                           CreditCard

# Train the logistic regression model
logreg_wf <- workflow() %>%
    add_model(logistic_reg() %>% set_engine("glm")) %>%
    add_formula(formula)
logreg_cv <- logreg_wf %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
logreg_cv_predictions <- collect_predictions(logreg_cv)

# Train the nearest-neighbor model with 5 neighbors
nn5_wf <- workflow() %>%
    add_model(nearest_neighbor(neighbors=5) %>%
              set_mode("classification") %>%
              set_engine("kknn")) %>%
    add_formula(formula)
nn5_cv <- nn5_wf %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn5_cv_predictions <- collect_predictions(nn5_cv)
# Tune the number of neighbors of the nearest-neighbor model
nn_wf <- workflow() %>%
    add_model(nearest_neighbor(neighbors=tune()) %>%
              set_mode("classification") %>%
              set_engine("kknn")) %>%
    add_formula(formula)
nn_default_tune <- tune_grid(nn_wf, resamples=folds)
autoplot(nn_default_tune)
best_nn_roc <- nn_default_tune %>%
    select_best(metric="roc_auc")
best_nn_accuracy <- nn_default_tune %>%
    select_best(metric="accuracy")
parameters <- extract_parameter_set_dials(nn_wf) %>%
    update(neighbors = neighbors(c(1, 100)))

nn_bayes_tune <- tune_bayes(nn_wf, resamples=folds, param_info=parameters, iter=25)
autoplot(nn_bayes_tune)
optimal_nn_roc <- nn_bayes_tune %>% select_best(metric="roc_auc")
cv_roc <- logreg_cv_predictions %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn5_roc <- nn5_cv_predictions %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

nn_default_auc <- nn_wf %>%
    finalize_workflow(select_best(nn_default_tune, metric="roc_auc")) %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_default_auc_roc <- nn_default_auc %>%
    collect_predictions() %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

nn_bayes_auc <- nn_wf %>%
    finalize_workflow(select_best(nn_bayes_tune, metric="roc_auc")) %>%
    fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_bayes_auc_roc <- nn_bayes_auc %>%
    collect_predictions() %>%
    roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")

g1 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn5_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(a) Initial model (accuracy, k=5)")

g2 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn_default_auc_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(b) Default model (AUC, k=14)")

g3 <- ggplot() +
    geom_path(data=cv_roc, aes(x=1 - specificity, y=sensitivity), color="gray") +
    geom_path(data=nn_bayes_auc_roc, aes(x=1 - specificity, y=sensitivity)) +
    geom_abline(lty=2) +
    labs(title="(c) Optimal model (AUC, k=48)")

g1 + g2 + g3
logreg_metrics <- collect_metrics(logreg_cv)
nn5_metrics <- collect_metrics(nn5_cv)
nn_default_metrics <- collect_metrics(nn_default_auc)
nn_bayes_metrics <- collect_metrics(nn_bayes_auc)

df <- bind_rows(
    logreg_metrics %>% mutate(model="Logistic regression"),
    nn5_metrics %>% mutate(model="Nearest neighbor (k=5)"),
    nn_default_metrics %>% mutate(model="Nearest neighbor (k=14)"),
    nn_bayes_metrics %>% mutate(model="Nearest neighbor (k=48)"),
)
df %>%
    select(model, mean, .metric) %>%
    pivot_wider(names_from=.metric, values_from=mean) %>%
    knitr::kable(digits = 3) %>%
    kableExtra::kable_styling(full_width=FALSE)
df %>%
    mutate(
        model = factor(model,
            levels=rev(c("Logistic regression", "Nearest neighbor (k=5)",
                        "Nearest neighbor (k=14)", "Nearest neighbor (k=48)")))
    ) %>%
ggplot(aes(x=mean, y=model)) +
    geom_point() +
    geom_pointrange(aes(xmin=mean - std_err, xmax=mean + std_err)) +
    facet_wrap(~ .metric)
threshold_graph <- function(model) {
    performance <- probably::threshold_perf(
        model %>% collect_predictions(),
        Personal.Loan, .pred_Yes,
        thresholds=seq(0.05, 0.9, 0.01), event_level="first",
        metrics=metric_set(accuracy, kap, bal_accuracy, f_meas)
    )
    max_values <- performance %>%
        arrange(desc(.threshold)) %>%
        group_by(.metric) %>%
        filter(.estimate == max(.estimate)) %>%
        filter(row_number() == 1)
    g <- ggplot(performance, aes(x=.threshold, y=.estimate, color=.metric)) +
            geom_line() +
            geom_vline(data=max_values, aes(xintercept=.threshold, color=.metric)) +
            scale_x_continuous(breaks=seq(0, 1, 0.1)) +
            coord_cartesian(ylim=c(0.5, 1)) +
            xlab("Threshold") + ylab("Metric value") +
            theme(legend.position="inside", legend.position.inside = c(0.85, 0.75))
    return(g)
}
g1 <- threshold_graph(logreg_cv) + labs(title="Logistic regression model")
g2 <- threshold_graph(nn_bayes_auc) + labs(title="k-nearest neighbor model")

g1 + g2
set.seed(123)
resamples <- vfold_cv(mtcars)
rec <- recipe(mpg ~ ., data=mtcars)
spec <- linear_reg(mode="regression", penalty=tune(), mixture=0) %>%
    set_engine("glmnet")
wf <- workflow() %>%
    add_recipe(rec) %>%
    add_model(spec)
parameters <- extract_parameter_set_dials(wf) %>%
    update(penalty=penalty(c(-2, 2)))
tune_results <- tune_grid(wf,
                          resamples=resamples,
                          grid=grid_regular(parameters, levels=50))
penalty_best <- select_best(tune_results, metric="rmse")
penalty_1se <- select_by_one_std_err(tune_results, desc(penalty), metric="rmse")
penalty_best <- show_best(tune_results, metric="rmse", n=1)
tune_results %>%
    collect_metrics() %>%
    filter(.metric == "rmse") %>%
ggplot(aes(x=penalty, y=mean, ymax=mean + std_err, ymin=mean - std_err)) +
    geom_line() +
    geom_errorbar(color="grey") +
    geom_point() +
    scale_x_log10() +
    coord_cartesian(ylim=c(2, 4)) +
    geom_vline(xintercept=penalty_best$penalty[1], color="blue") +
    geom_vline(xintercept=penalty_1se$penalty[1], color="red") +
    geom_hline(yintercept=penalty_best$mean[1] + penalty_best$std_err[1], color="blue")
stopCluster(cl)
registerDoSEQ()

References

Breiman, Leo, Jerome H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Chapman & Hall.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-1418-1.

  1. You can find different definitions of elasticnet regularization in the literature. Here, we use the definition from the glmnet package.↩︎

  2. I initially ran this code using size=10 for the random grid search. There was a large gap between the performance of the best model and the other models. This is an indication that the search was not exhaustive enough.↩︎