Chapter 15 Model tuning - examples
In the previous chapters, we learned how to define and tune hyperparameters using the tidymodels packagesdials
and tune
. In this chapter, we will go deeper into specific areas of model tuning. Figure 15.1 shows how this step fits into the overall model workflow.
Load the packages we need for this chapter.
Because tuning requires training many models, we also enable parallel computing.
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
15.1 Feature engineering
We already saw in Chapter 14 that preprocessing steps have tunable parameters. The function tunable()
can be used to identify tunable parameters in a recipe. For example, the following code shows how to identify tunable parameters in a recipe that contains a polynomial step.
Code
# tidymodels is very picky about data types and will complain when we
# predict on new data if the age value is not an integer. We therefore
# convert here age to double
data <- ISLR2::Wage %>%
mutate(age = as.double(age))
recipe(wage ~ age, data=data) %>%
step_poly(hp) %>%
step_discretize(hp) %>%
step_cut(hp, breaks=tune()) %>%
step_bs(hp) %>%
tunable()
## # A tibble: 5 × 5
## name call_info source component component_id
## <chr> <list> <chr> <chr> <chr>
## 1 degree <named list [2]> recipe step_poly poly_sj9cB
## 2 min_unique <named list [2]> recipe step_discretize discretize_7r5UK
## 3 num_breaks <named list [2]> recipe step_discretize discretize_7r5UK
## 4 deg_free <named list [3]> recipe step_bs bs_VYevV
## 5 degree <named list [3]> recipe step_bs bs_VYevV
We see that the polynomial step has a tunable parameter degree
.
15.1.1 Polynomial regression
With this information, we can tune a polynomial regression model.
Code
set.seed(123)
poly_recipe <- recipe(wage ~ age, data=data) %>%
step_poly(age, degree=tune())
model <- linear_reg(mode="regression") %>%
set_engine("glm")
poly_workflow <- workflow() %>%
add_recipe(poly_recipe) %>%
add_model(model)
tune_results <- tune_grid(poly_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
## # A tibble: 3 × 7
## degree .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 rmse standard 39.7 10 1.30 Preprocessor3_Model1
## 2 2 rmse standard 39.8 10 1.30 Preprocessor1_Model1
## 3 1 rmse standard 40.8 10 1.28 Preprocessor2_Model1
By default, this will tune the polynomial degree from 1 to 3. The best model was with a quadratic polynomial. We can use the finalize_workflow()
function to fit the best model to the entire data set.
Code
Figure 15.2 shows the best fit of the polynomial regression model. The shaded area shows the 95% confidence interval of the fit. The polynomial regression model is a linear model, but the polynomial step allows us to fit a non-linear relationship between the predictor and the outcome.
Code
df <- tibble(age = seq(min(data$age), max(data$age), length.out = 100))
df %>%
bind_cols(
predict(final_model, new_data=df),
predict(final_model, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
15.1.2 Step function regression
In this section, we develop a model using a step function. The step_discretize()
function is used to convert the age
features into bins. The tunable parameter is num_breaks
.
Code
## # A tibble: 5 × 7
## num_breaks .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 rmse standard 40.3 10 1.27 Preprocessor6_Model1
## 2 4 rmse standard 40.3 10 1.25 Preprocessor2_Model1
## 3 7 rmse standard 40.8 10 1.28 Preprocessor1_Model1
## 4 6 rmse standard 40.8 10 1.28 Preprocessor3_Model1
## 5 8 rmse standard 40.8 10 1.28 Preprocessor4_Model1
Tuning determines that the best model uses 5 bins. We can use the finalize_workflow()
function to fit the best model to the entire data set.
Code
Figure 15.3 shows the best fit of the regression model using a step function.
Code
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
cuts <- tibble(breaks=quantile(data$age, probs=seq(0, 1, by=1/best_breaks)))
df %>%
bind_cols(
predict(final_model, new_data=df),
predict(final_model, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_vline(aes(xintercept=breaks), data=cuts, color='darkgreen', alpha=0.5) +
geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
The model developed in this section is different from the step function model described in Sections 7.2 and 7.8.1 of An introduction to statistical learning (James et al. 2021). In the book, the authors use cut(age, 4)
to create a categorical variable with four levels. This leads to different boundaries than defined by the quantiles that the step_discretize()
function uses. If you want to use defined boundaries, e.g. split into age groups of 10 years, you can use
step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)
The breaks are however not tunable in this case. Figure 15.4 shows the results of this model.
Code
step_recipe <- recipe(wage ~ age, data=data) %>%
step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)
step_workflow <- workflow() %>%
add_recipe(step_recipe) %>%
add_model(model)
trained <- step_workflow %>% fit(data)
cuts <- tibble(breaks=seq(20, 70, by=10))
df %>%
bind_cols(
predict(trained, new_data=df),
predict(trained, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_vline(aes(xintercept=breaks), data=cuts, color='darkgreen', alpha=0.5) +
geom_point(aes(x=age, y=wage),data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
15.1.3 Spline regression
Next we will tune a model using a spline representation for age
.
Code
set.seed(123)
spline_recipe <- recipe(wage ~ age, data=data) %>%
step_bs(age, degree=tune(), deg_free=tune())
spline_workflow <- workflow() %>%
add_recipe(spline_recipe) %>%
add_model(model)
tune_results <- tune_grid(spline_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
## # A tibble: 5 × 8
## deg_free degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 2 rmse standard 39.7 10 1.30 Preprocessor6_Model1
## 2 8 1 rmse standard 39.7 10 1.28 Preprocessor8_Model1
## 3 4 2 rmse standard 39.7 10 1.30 Preprocessor1_Model1
## 4 14 1 rmse standard 39.7 10 1.27 Preprocessor7_Model1
## 5 10 2 rmse standard 39.8 10 1.29 Preprocessor4_Model1
Code
Figure 15.5 shows the best fit of the spline regression model.
Code
Other examples of preprocessing steps that can be tuned for feature engineering are:
step_pca
using thenum_comp
parameterstep_nzv
to remove variables that are highly sparse and unbalanced using thefreq_cut
orunique_cut
parameters
15.2 Regularization
Regularization is used in several machine learning methods. In general, it is any approach that controls the complexity of a model. For linear models, there are two common types of regularization. Both control the complexity of the model by penalizing the size of the coefficients.
L2-regularization (ridge penalty) \[ \textrm{Penalty}_{L2}(\beta, \lambda) = \lambda \sum_{j=1}^p \left|\beta_j \right|^2 = \beta^T\beta \]
L1-regularization (lasso penalty) \[ \textrm{Penalty}_{L1}(\beta, \lambda) = \lambda \sum_{j=1}^p \left|\beta_j \right| \]
In some cases, it is useful to combine the L1 and L2 regularization. This is called elasticnet regularization.11 The elasticnet penalty is a weighted combination of the L1 and L2 penalties. The weight \(\alpha\) controls the relative contribution of the L1 and L2 penalties. When \(\alpha=0\), the elasticnet penalty is equivalent to the L2 penalty. When \(\alpha=1\), the elasticnet penalty is equivalent to the L1 penalty. \[ P_{elastic}(\beta) = \alpha P_{L1}(\beta, \lambda) + \frac12 (1-\alpha) P_{L2}P(\beta, \lambda) \]
The glmnet
package implements both L1 and L2 regularization as well as the elasticnet penalty. We’ve already seen in Section 14.4 how we can explore the penalty
and mixture
hyperparameters with tidymodels. See Chapter 21 for more details.
15.3 Feature selection
Many text books cover methods like forward and backward stepwise selection and best subset selection. The authors of tidymodels do not recommended these approaches for feature selection. It’s unlikely that they will add these methods to tidymodels in the future. Instead, they suggest using regularization, in particular L1 regularization, or other methods. In fact, they work on a new package called colino
that implements several approaches to feature selection.
The colino
package is still in development and is not yet available on CRAN. You can find the development version on GitHub and install it with the following code.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'colino'
## pROC (NA -> 1.18.5) [CRAN]
##
## The downloaded binary packages are in
## /var/folders/_8/ms0ft4913k3290v7f0g_yfpc0000gn/T//RtmpQILVwP/downloaded_packages
## ── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##
checking for file ‘/private/var/folders/_8/ms0ft4913k3290v7f0g_yfpc0000gn/T/RtmpQILVwP/remotes14925431ea8a/stevenpawley-colino-23c9046/DESCRIPTION’ ...
✔ checking for file ‘/private/var/folders/_8/ms0ft4913k3290v7f0g_yfpc0000gn/T/RtmpQILVwP/remotes14925431ea8a/stevenpawley-colino-23c9046/DESCRIPTION’
##
─ preparing ‘colino’:
##
checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ building ‘colino_0.0.1.tar.gz’
##
##
The methods are all implemented as steps compatible with the recipe
package.
Todo:
Look through the package documentation at https://stevenpawley.github.io/colino/ to get an overview of the various methods.
The following code shows how to use the step_select_forests()
function to select features based on variable importance derived from a random forest model. The actual model will be a \(k\)-nearest neighbor model.
Code
set.seed(123)
resamples <- vfold_cv(mtcars)
rec_select <- recipe(mpg ~ ., data=mtcars) %>%
step_normalize(all_numeric_predictors()) %>%
step_select_forests(all_predictors(), outcome="mpg", top_p=tune())
model <- nearest_neighbor(mode="regression", neighbors=tune())
mtcars_workflow <- workflow() %>%
add_recipe(rec_select) %>%
add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
name | id | source | component | component_id | object |
---|---|---|---|---|---|
neighbors | neighbors | model_spec | nearest_neighbor | main | integer , 1 , 15 , TRUE , TRUE , # Nearest Neighbors |
top_p | top_p | recipe | step_select_forests | select_forests_iOjwA | integer, 1, 4, TRUE, TRUE, # Selected Predictors, function (object, x, log_vals = FALSE, …) , {, check_param(object), rngs <- range_get(object, original = FALSE), if (!is_unknown(rngs\(upper)) {, return(object), }, x_dims <- dim(x), if (is.null(x_dims)) {, rlang::abort("Cannot determine number of columns. Is `x` a 2D data object?"), }, if (log_vals) {, rngs[2] <- log10(x_dims[2]), }, else {, rngs[2] <- x_dims[2], }, if (object\)type == “integer” & is.null(object$trans)) {, rngs <- as.integer(rngs), }, range_set(object, rngs), } |
The step_select_forests()
function has various tunable parameters. Here, we use top_p
, the number of most important features to keep.
Now we have everything we need to tune the workflow.12
Code
neighbors | top_p | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|
6 | 4 | rmse | standard | 2.341911 | 10 | 0.2005014 |
8 | 4 | rmse | standard | 2.494542 | 10 | 0.2542292 |
1 | 4 | rmse | standard | 2.515561 | 10 | 0.2915867 |
9 | 4 | rmse | standard | 2.565404 | 10 | 0.2747478 |
3 | 3 | rmse | standard | 2.588270 | 10 | 0.3163073 |
The best model selected 4 features and 6 for the number of neighbors in the \(k\)-NN model. We can use the finalize_workflow()
function to fit the best model to the entire data set and visualize the results, see Figure 15.7.
Code
Code
We can also extract the recipe and look at the results from the feature importance.
## # A tibble: 10 × 3
## variable score id
## <chr> <dbl> <chr>
## 1 wt 100 select_forests_iOjwA
## 2 disp 80.3 select_forests_iOjwA
## 3 cyl 74.8 select_forests_iOjwA
## 4 hp 63.8 select_forests_iOjwA
## 5 carb 12.8 select_forests_iOjwA
## 6 drat 9.67 select_forests_iOjwA
## 7 vs 4.65 select_forests_iOjwA
## 8 qsec 4.37 select_forests_iOjwA
## 9 am 3.50 select_forests_iOjwA
## 10 gear 0 select_forests_iOjwA
The top-4 features used in the final model are wt, disp, cyl, hp.
15.4 Hyperparameter tuning
We already covered hyperparameter tuning in Chapter 14, so what else could there be said? Let’s revisit the example from Chapter 13. In that chapter, we compared the performance of a \(k\)-nearest neighbor model and a logistic regression model to predict the probability of a customer taking a loan. Comparing the ROC curves showed strange behavior of the \(k\)-NN model. It’s likely that our model requires further tuning.
Let’s start with loading and preprocessing the data.
Code
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/UniversalBank.csv.gz")
data <- data %>%
select(-c(ID, `ZIP Code`)) %>%
rename(
Personal.Loan = `Personal Loan`,
Securities.Account = `Securities Account`,
CD.Account = `CD Account`
) %>%
mutate(
Personal.Loan = factor(Personal.Loan, labels=c("Yes", "No"), levels=c(1, 0)),
Education = factor(Education, labels=c("Undergrad", "Graduate", "Advanced")),
)
Next we repeat the model training from Chapter 13. We use 10-fold crossvalidation to estimate the performance of the models.
Code
set.seed(1353)
folds <- vfold_cv(data, strata=Personal.Loan)
formula <- Personal.Loan ~ Age + Experience + Income + Family + CCAvg + Education +
Mortgage + Securities.Account + CD.Account + Online +
CreditCard
# Train the logistic regression model
logreg_wf <- workflow() %>%
add_model(logistic_reg() %>% set_engine("glm")) %>%
add_formula(formula)
logreg_cv <- logreg_wf %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
logreg_cv_predictions <- collect_predictions(logreg_cv)
# Train the nearest-neighbor model with 5 neighbors
nn5_wf <- workflow() %>%
add_model(nearest_neighbor(neighbors=5) %>%
set_mode("classification") %>%
set_engine("kknn")) %>%
add_formula(formula)
nn5_cv <- nn5_wf %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn5_cv_predictions <- collect_predictions(nn5_cv)
Figure 15.10 shows the ROC curves for the logistic regression and the 5-NN model. The logistic regression model performs better than the 5-NN model. The 5-NN model has a strange shape, which is likely due to the fact that we did not tune the model.
Let’s tune the model with the default settings
Code
Figure 15.8 shows the results of the tuning process. There are two interesting observations to point out. First, the selected number of neighbors depends on the metric. If we use accuracy
, the optimal number of neighbors is 5, the same value we used in our initial model. For ROC AUC, the optimal number of neighbors is 14.
The second observation is that the roc_auc
curve has not reached a maximum. This means that we have not explored the entire parameter space. We can increase the number of neighbors to explore the parameter space further.
Code
## ! No improvement for 10 iterations; returning current results.
This time, the hyperparameter search identified the optimal number of neighbors as 48.
Code
cv_ROC <- logreg_cv_predictions %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn5_ROC <- nn5_cv_predictions %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn_default_auc <- nn_wf %>%
finalize_workflow(select_best(nn_default_tune, metric="roc_auc")) %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_default_auc_ROC <- nn_default_auc %>% collect_predictions() %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn_bayes_auc <- nn_wf %>%
finalize_workflow(select_best(nn_bayes_tune, metric="roc_auc")) %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_bayes_auc_ROC <- nn_bayes_auc %>% collect_predictions() %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
g1 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn5_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(a) Initial model (accuracy, k=5)")
g2 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn_default_auc_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(b) Default model (AUC, k=14)")
g3 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn_bayes_auc_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(c) Optimal model (AUC, k=48)")
g1 + g2 + g3
Comparing the three ROC curves, we see that the fully tuned model, with sufficient exploration of the hyperparameter space, performs better than the initial model across the full data range.
Let’s summarize the data:
Code
logreg_metrics <- collect_metrics(logreg_cv)
nn5_metrics <- collect_metrics(nn5_cv)
nn_default_metrics <- collect_metrics(nn_default_auc)
nn_bayes_metrics <- collect_metrics(nn_bayes_auc)
df <- bind_rows(logreg_metrics %>% mutate(model='Logistic regression'),
nn5_metrics %>% mutate(model='Nearest neighbor (k=5)'),
nn_default_metrics %>% mutate(model='Nearest neighbor (k=14)'),
nn_bayes_metrics %>% mutate(model='Nearest neighbor (k=48)'),
)
df %>%
select(model, mean, .metric) %>%
pivot_wider(names_from=.metric, values_from=mean) %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(full_width=FALSE)
model | accuracy | brier_class | roc_auc |
---|---|---|---|
Logistic regression | 0.958 | 0.032 | 0.961 |
Nearest neighbor (k=5) | 0.965 | 0.028 | 0.938 |
Nearest neighbor (k=14) | 0.960 | 0.029 | 0.966 |
Nearest neighbor (k=48) | 0.947 | 0.037 | 0.978 |
Code
Tuning the hyperparameters of the \(k\)-NN model improved the ROC AUC to larger value than the logistic regression. The accuracy on the other hand dropped. However, as we learned in Chapter 11, accuracy is a performance metric that depends on the threshold used for classification. Using the methods described in Section 11.1.2, we can tweak the thresholds of the logistic regression and the best \(k\)-NN model. In the following example, we use model %>% collect_predictions()
to use the out-of-fold predictions from cross-validation with the probably::threshold_perf
method.
Code
threshold_graph <- function(model) {
performance <- probably::threshold_perf(
model %>% collect_predictions(),
Personal.Loan, .pred_Yes,
thresholds=seq(0.05, 0.9, 0.01), event_level="first",
metrics=metric_set(accuracy, kap, bal_accuracy, f_meas)
)
max_values <- performance %>%
arrange(desc(.threshold)) %>%
group_by(.metric) %>%
filter(.estimate == max(.estimate)) %>%
filter(row_number()==1)
g <- ggplot(performance, aes(x=.threshold, y=.estimate, color=.metric)) +
geom_line() +
geom_vline(data=max_values, aes(xintercept=.threshold, color=.metric)) +
scale_x_continuous(breaks=seq(0, 1, 0.1)) +
coord_cartesian(ylim=c(0.5, 1)) +
xlab('Threshold') + ylab('Metric value') +
theme(legend.position="inside", legend.position.inside = c(0.85, 0.75))
return (g)
}
g1 <- threshold_graph(logreg_cv) + labs(title="Logistic regression model")
g2 <- threshold_graph(nn_bayes_auc) + labs(title="k-nearest neighbor model")
g1 + g2
Figure 15.12 shows the performance of the logistic regression and the \(k\)-NN model for different thresholds. For all metrics, the threshold dependent maxima are higher for \(k\)-NN compared to logistic regression.
15.5 The one-standard-error rule
Breiman et al (Breiman et al. 1984) suggested that instead of selecting the best model, we should select the simplest model within one standard error of the best model. This is called the one-standard-error rule. The idea is that the best model is likely to be overfitted and that a simpler model is likely to generalize better.
Let’s demonstrate this with a ridge regression model to predict mpg
in the mtcars
dataset. First tune the model with a suitable range for the penalty parameter.
Code
set.seed(123)
resamples <- vfold_cv(mtcars)
rec <- recipe(mpg ~ ., data=mtcars)
spec <- linear_reg(mode="regression", penalty=tune(), mixture=0) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(spec)
parameters <- extract_parameter_set_dials(wf) %>%
update(penalty=penalty(c(-2, 2)))
tune_results <- tune_grid(wf,
resamples=resamples,
grid=grid_regular(parameters, levels=50))
The one-standard-error rule is implemented in the select_by_one_std_err()
function. It is used in a similar way to the select_best
function.
Code
The implementation doesn’t know what a simpler model is, so we need to tell it. In this case, we use the penalty parameter as a proxy for model complexity and inform the model that the complexity decreases with the penalty value.
Figure 15.13 shows the results. The best model is the one with the lowest RMSE. The one-standard-error rule adds one standard error to the minimum and then moves horizontally to the right until it crosses the curve. The penalty value to the left of the crossing is the selected parameter.
Code
penalty_best <- show_best(tune_results, metric="rmse", n=1)
tune_results %>% collect_metrics() %>%
filter(.metric=="rmse") %>%
ggplot(aes(x=penalty, y=mean, ymax=mean+std_err, ymin=mean-std_err)) +
geom_line() +
geom_errorbar(color="grey") +
geom_point() +
scale_x_log10() +
coord_cartesian(ylim=c(2, 4)) +
geom_vline(xintercept=penalty_best$penalty[1], color="blue") +
geom_vline(xintercept=penalty_1se$penalty[1], color="red") +
geom_hline(yintercept=penalty_best$mean[1] + penalty_best$std_err[1], color="blue")
Useful to know:
In this section, we learned that it is important to:
- sufficiently explore the hyperparameter space
- use the right metric for model selection (ROC)
- select the threshold after you identified the best model
Further information:
- https://dials.tidymodels.org/
dials
package - https://stevenpawley.github.io/colino/
colino
package
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_tune.png")
library(tidymodels)
library(tidyverse)
library(patchwork)
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)
# tidymodels is very picky about data types and will complain when we
# predict on new data if the age value is not an integer. We therefore
# convert here age to double
data <- ISLR2::Wage %>%
mutate(age = as.double(age))
recipe(wage ~ age, data=data) %>%
step_poly(hp) %>%
step_discretize(hp) %>%
step_cut(hp, breaks=tune()) %>%
step_bs(hp) %>%
tunable()
set.seed(123)
poly_recipe <- recipe(wage ~ age, data=data) %>%
step_poly(age, degree=tune())
model <- linear_reg(mode="regression") %>%
set_engine("glm")
poly_workflow <- workflow() %>%
add_recipe(poly_recipe) %>%
add_model(model)
tune_results <- tune_grid(poly_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- poly_workflow %>%
finalize_workflow(best_parameters) %>%
fit(data)
df <- tibble(age = seq(min(data$age), max(data$age), length.out = 100))
df %>%
bind_cols(
predict(final_model, new_data=df),
predict(final_model, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
set.seed(123)
step_recipe <- recipe(wage ~ age, data=data) %>%
step_discretize(age, num_breaks=tune())
step_workflow <- workflow() %>%
add_recipe(step_recipe) %>%
add_model(model)
tune_results <- tune_grid(step_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- step_workflow %>%
finalize_workflow(best_parameters) %>%
fit(data)
best_breaks <- (tune_results %>% show_best(metric="rmse"))$num_breaks[1]
cuts <- tibble(breaks=quantile(data$age, probs=seq(0, 1, by=1/best_breaks)))
df %>%
bind_cols(
predict(final_model, new_data=df),
predict(final_model, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_vline(aes(xintercept=breaks), data=cuts, color='darkgreen', alpha=0.5) +
geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
step_recipe <- recipe(wage ~ age, data=data) %>%
step_cut(age, breaks=seq(20, 70, by=10), include_outside_range=TRUE)
step_workflow <- workflow() %>%
add_recipe(step_recipe) %>%
add_model(model)
trained <- step_workflow %>% fit(data)
cuts <- tibble(breaks=seq(20, 70, by=10))
df %>%
bind_cols(
predict(trained, new_data=df),
predict(trained, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_vline(aes(xintercept=breaks), data=cuts, color='darkgreen', alpha=0.5) +
geom_point(aes(x=age, y=wage),data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
set.seed(123)
spline_recipe <- recipe(wage ~ age, data=data) %>%
step_bs(age, degree=tune(), deg_free=tune())
spline_workflow <- workflow() %>%
add_recipe(spline_recipe) %>%
add_model(model)
tune_results <- tune_grid(spline_workflow, resamples=vfold_cv(data))
tune_results %>% show_best(metric="rmse")
best_parameters <- select_best(tune_results, metric="rmse")
final_model <- spline_workflow %>%
finalize_workflow(best_parameters) %>%
fit(data)
df %>%
bind_cols(
predict(final_model, new_data=df),
predict(final_model, new_data=df, type="conf_int")
) %>%
ggplot(aes(x=age, y=.pred)) +
geom_point(aes(x=age, y=wage), data=data, alpha=0.1) +
geom_line() +
geom_ribbon(aes(ymin=.pred_lower, ymax=.pred_upper), alpha=0.2) +
labs(x="Age", y="Wage")
knitr::include_graphics("images/regularization.png")
if (!require(colino)) {
devtools::install_github("stevenpawley/colino", force=TRUE)
}
library(colino)
set.seed(123)
resamples <- vfold_cv(mtcars)
rec_select <- recipe(mpg ~ ., data=mtcars) %>%
step_normalize(all_numeric_predictors()) %>%
step_select_forests(all_predictors(), outcome="mpg", top_p=tune())
model <- nearest_neighbor(mode="regression", neighbors=tune())
mtcars_workflow <- workflow() %>%
add_recipe(rec_select) %>%
add_model(model)
parameters <- extract_parameter_set_dials(mtcars_workflow)
parameters %>% knitr::kable()
tune_results <- tune_grid(mtcars_workflow,
resamples=resamples,
grid=grid_random(parameters, size=50))
tune_results %>%
show_best(metric='rmse') %>%
select(-.config) %>%
knitr::kable()
nfeatures <- (tune_results %>% show_best(metric='rmse'))$top_p[1]
nneighbors <- (tune_results %>% show_best(metric='rmse'))$neighbors[1]
best_parameters <- select_best(tune_results, metric="rmse")
best_workflow <- mtcars_workflow %>%
finalize_workflow(best_parameters) %>%
fit(mtcars)
feature_scores <- best_workflow %>%
extract_recipe() %>%
tidy(number = 2, type = "scores")
kept_features <- feature_scores$variable[1:nfeatures]
mtcars %>%
bind_cols(
predict(best_workflow, new_data=mtcars)
) %>%
ggplot(aes(x=mpg, y=.pred)) +
geom_point() +
geom_abline() +
xlim(10, 35) + ylim(10, 35) +
labs(x="Observed mpg", y="Predicted mpg")
best_workflow %>% extract_recipe() %>%
tidy(number = 2, type = "scores")
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/UniversalBank.csv.gz")
data <- data %>%
select(-c(ID, `ZIP Code`)) %>%
rename(
Personal.Loan = `Personal Loan`,
Securities.Account = `Securities Account`,
CD.Account = `CD Account`
) %>%
mutate(
Personal.Loan = factor(Personal.Loan, labels=c("Yes", "No"), levels=c(1, 0)),
Education = factor(Education, labels=c("Undergrad", "Graduate", "Advanced")),
)
set.seed(1353)
folds <- vfold_cv(data, strata=Personal.Loan)
formula <- Personal.Loan ~ Age + Experience + Income + Family + CCAvg + Education +
Mortgage + Securities.Account + CD.Account + Online +
CreditCard
# Train the logistic regression model
logreg_wf <- workflow() %>%
add_model(logistic_reg() %>% set_engine("glm")) %>%
add_formula(formula)
logreg_cv <- logreg_wf %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
logreg_cv_predictions <- collect_predictions(logreg_cv)
# Train the nearest-neighbor model with 5 neighbors
nn5_wf <- workflow() %>%
add_model(nearest_neighbor(neighbors=5) %>%
set_mode("classification") %>%
set_engine("kknn")) %>%
add_formula(formula)
nn5_cv <- nn5_wf %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn5_cv_predictions <- collect_predictions(nn5_cv)
# Tune the number of neighbors of the nearest-neighbor model
nn_wf <- workflow() %>%
add_model(nearest_neighbor(neighbors=tune()) %>%
set_mode("classification") %>%
set_engine("kknn")) %>%
add_formula(formula)
nn_default_tune <- tune_grid(nn_wf, resamples=folds)
autoplot(nn_default_tune)
best_nn_roc <- nn_default_tune %>%
select_best(metric="roc_auc")
best_nn_accuracy <- nn_default_tune %>%
select_best(metric="accuracy")
parameters <- extract_parameter_set_dials(nn_wf) %>%
update(neighbors = neighbors(c(1, 100)))
nn_bayes_tune <- tune_bayes(nn_wf, resamples=folds, param_info=parameters, iter=25)
autoplot(nn_bayes_tune)
optimal_nn_roc <- nn_bayes_tune %>% select_best(metric="roc_auc")
cv_ROC <- logreg_cv_predictions %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn5_ROC <- nn5_cv_predictions %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn_default_auc <- nn_wf %>%
finalize_workflow(select_best(nn_default_tune, metric="roc_auc")) %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_default_auc_ROC <- nn_default_auc %>% collect_predictions() %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
nn_bayes_auc <- nn_wf %>%
finalize_workflow(select_best(nn_bayes_tune, metric="roc_auc")) %>%
fit_resamples(resamples=folds, control=control_resamples(save_pred=TRUE))
nn_bayes_auc_ROC <- nn_bayes_auc %>% collect_predictions() %>%
roc_curve(truth=Personal.Loan, .pred_Yes, event_level="first")
g1 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn5_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(a) Initial model (accuracy, k=5)")
g2 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn_default_auc_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(b) Default model (AUC, k=14)")
g3 <- ggplot() +
geom_path(data=cv_ROC, aes(x=1-specificity, y=sensitivity), color="gray") +
geom_path(data=nn_bayes_auc_ROC, aes(x=1-specificity, y=sensitivity)) +
geom_abline(lty=2) +
labs(title="(c) Optimal model (AUC, k=48)")
g1 + g2 + g3
logreg_metrics <- collect_metrics(logreg_cv)
nn5_metrics <- collect_metrics(nn5_cv)
nn_default_metrics <- collect_metrics(nn_default_auc)
nn_bayes_metrics <- collect_metrics(nn_bayes_auc)
df <- bind_rows(logreg_metrics %>% mutate(model='Logistic regression'),
nn5_metrics %>% mutate(model='Nearest neighbor (k=5)'),
nn_default_metrics %>% mutate(model='Nearest neighbor (k=14)'),
nn_bayes_metrics %>% mutate(model='Nearest neighbor (k=48)'),
)
df %>%
select(model, mean, .metric) %>%
pivot_wider(names_from=.metric, values_from=mean) %>%
knitr::kable(digits = 3) %>%
kableExtra::kable_styling(full_width=FALSE)
df %>%
mutate(
model = factor(model,
levels=rev(c("Logistic regression", "Nearest neighbor (k=5)",
"Nearest neighbor (k=14)", "Nearest neighbor (k=48)")))
) %>%
ggplot(aes(x=mean, y=model)) +
geom_point() +
geom_pointrange(aes(xmin=mean-std_err, xmax=mean+std_err)) +
facet_wrap(~ .metric)
threshold_graph <- function(model) {
performance <- probably::threshold_perf(
model %>% collect_predictions(),
Personal.Loan, .pred_Yes,
thresholds=seq(0.05, 0.9, 0.01), event_level="first",
metrics=metric_set(accuracy, kap, bal_accuracy, f_meas)
)
max_values <- performance %>%
arrange(desc(.threshold)) %>%
group_by(.metric) %>%
filter(.estimate == max(.estimate)) %>%
filter(row_number()==1)
g <- ggplot(performance, aes(x=.threshold, y=.estimate, color=.metric)) +
geom_line() +
geom_vline(data=max_values, aes(xintercept=.threshold, color=.metric)) +
scale_x_continuous(breaks=seq(0, 1, 0.1)) +
coord_cartesian(ylim=c(0.5, 1)) +
xlab('Threshold') + ylab('Metric value') +
theme(legend.position="inside", legend.position.inside = c(0.85, 0.75))
return (g)
}
g1 <- threshold_graph(logreg_cv) + labs(title="Logistic regression model")
g2 <- threshold_graph(nn_bayes_auc) + labs(title="k-nearest neighbor model")
g1 + g2
set.seed(123)
resamples <- vfold_cv(mtcars)
rec <- recipe(mpg ~ ., data=mtcars)
spec <- linear_reg(mode="regression", penalty=tune(), mixture=0) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(spec)
parameters <- extract_parameter_set_dials(wf) %>%
update(penalty=penalty(c(-2, 2)))
tune_results <- tune_grid(wf,
resamples=resamples,
grid=grid_regular(parameters, levels=50))
penalty_best <- select_best(tune_results, metric="rmse")
penalty_1se <- select_by_one_std_err(tune_results, desc(penalty), metric="rmse")
penalty_best <- show_best(tune_results, metric="rmse", n=1)
tune_results %>% collect_metrics() %>%
filter(.metric=="rmse") %>%
ggplot(aes(x=penalty, y=mean, ymax=mean+std_err, ymin=mean-std_err)) +
geom_line() +
geom_errorbar(color="grey") +
geom_point() +
scale_x_log10() +
coord_cartesian(ylim=c(2, 4)) +
geom_vline(xintercept=penalty_best$penalty[1], color="blue") +
geom_vline(xintercept=penalty_1se$penalty[1], color="red") +
geom_hline(yintercept=penalty_best$mean[1] + penalty_best$std_err[1], color="blue")
stopCluster(cl)
registerDoSEQ()
References
You can find different definitions of elasticnet regularization in the literature. Here, we use the definition from the
glmnet
package.↩︎I initially ran this code using
size=10
for the random grid search. There was a large gap between the performance of the best model and the other models. This is an indication that the search was not exhaustive enough.↩︎