Chapter 6 Workflows: Connecting the parts
Training a model is a multi-step process. It requires:
- defining the model
- training and validating the model
- deploying the model
Figure 6.1 summarizes the modeling workflow and shows how the individual steps are implemented in the tidymodels
framework.
The left column covers the model definition part. A complete model defintion requires:
- Preprocessing (
recipe
package - see Chapter 7) - Model specification (
parsnip
package - see Chapters 8 and 10) - Postprocessing (
probably
- see Section 11.1.2)
The workflow
package from tidymodels, allows to combine the first two steps into a single workflow
object. The workflow
object can then be used to train and validate the model. Only the preprocessing and model specification can be included in the workflow
at the moment. While the postprocessing
step should be part of the full model process, the workflow
package doesn’t support it. For now, the postprocessing
step has to be done separately. For example, we will see in Section 11.1.2 how the probably
package can be used to define a threshold for binary classification models. In this class, we will only use postprocessing for classification models.
The workflow
package is also able to orchestrate the model tuning and validation. It involves:
- Model tuning (
tune
package - see Chapter 14) - Model validation (
rsample
,yardstick
packages - see Chapter 12, 13) - Tune postprocessing (
probably
package - see Section 14)
The objective of model tuning is to find the best model parameters. This can include the model hyperparameters (e.g. the number of trees in a random forest) and the preprocessing parameters (e.g. the number of principal components in a PCA). The tune
package allows to define potential values and combinations of these parameters. This combined with the validation strategy defined using the rsample
package, allows tune
to examine the performance of different models and select the “best” one. The performance is measured using the various metrics provided by the yardstick
package.
At the end of the model training step, we end up with a final trained workflow for deployment. For now, this means
- predict new data using the final model by:
- preprocessing the new data using the (tuned) preprocessing steps
- predicting with the (tuned) model
- if applicable, postprocessing the predictions (e.g. applying a threshold for the predicted class probabilities)
6.1 Example
In the following chapters, we will discuss the individual parts and see how they fit together. Here, we use the Loan prediction dataset to illustrate the whole process.
Load the required packages:
Load and preprocess the data:
Code
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/loan_prediction.csv",
show_col_types=FALSE) %>%
drop_na() %>%
mutate(
Gender=as.factor(Gender),
Married=as.factor(Married),
Dependents=gsub("\\+", "", Dependents) %>% as.numeric(),
Education=as.factor(Education),
Self_Employed=as.factor(Self_Employed),
Credit_History=as.factor(Credit_History),
Property_Area=as.factor(Property_Area),
Loan_Status=factor(Loan_Status, levels=c("N", "Y"), labels=c("No", "Yes"))
) %>%
select(-Loan_ID)
Split dataset into training and holdout data, prepare for cross-validation:
Code
set.seed(123)
data_split <- initial_split(data, prop=0.8, strata=Loan_Status)
train_data <- training(data_split)
holdout_data <- testing(data_split)
resamples <- vfold_cv(train_data, v=10, strata=Loan_Status)
cv_metrics <- metric_set(roc_auc, accuracy)
cv_control <- control_resamples(save_pred=TRUE)
Define the recipe, the model specification (elasticnet logistic regression), and combine them into a workflow:
Code
formula <- Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed +
ApplicantIncome + CoapplicantIncome + LoanAmount + Loan_Amount_Term +
Credit_History + Property_Area
recipe_spec <- recipe(formula, data=train_data) %>%
step_dummy(all_nominal(), -all_outcomes())
model_spec <- logistic_reg(engine="glmnet", mode="classification",
penalty=tune(), mixture=tune())
wf <- workflow() %>%
add_model(model_spec) %>%
add_recipe(recipe_spec)
Tune the penalty and mixture hyperparameters using Bayesian hyperparameter optimization:
Code
## ! No improvement for 10 iterations; returning current results.
The autoplot of the tune_bayes
object (Figure 6.2) shows the ROC-AUC for different values of the penalty and mixture hyperparameters. We can see that the best roc_auc
is obtained with penalty and mixture values inside the tuning range. We don’t need to adjust the sampling ranges for the hyperparameters.
Finalize the workflow:
Code
The best roc_auc
is obtained with a penalty of best_parameter['penalty'] =
0.0216104 and a mixture of best_parameter['mixture'] =
0.0517287.
Use the tuned workflow for cross-validation and training the final model using the full dataset:
Code
Estimate model performance using the cross-validation results and the holdout data:
Code
cv_results <- collect_metrics(result_cv) %>%
select(.metric, mean) %>%
rename(.estimate=mean) %>%
mutate(result="Cross-validation")
holdout_predictions <- augment(fitted_model, new_data=holdout_data)
holdout_results <- bind_rows(
c(roc_auc(holdout_predictions, Loan_Status, .pred_Yes, event_level="second")),
c(accuracy(holdout_predictions, Loan_Status, .pred_class))
) %>%
select(-.estimator) %>%
mutate(result="Holdout")
The performance metrics are summarized in the following table.
Code
result | accuracy | roc_auc |
---|---|---|
Cross-validation | 0.802 | 0.754 |
Holdout | 0.835 | 0.730 |
6.2 Models vs. workflows
It may initially be confusing to have a second way to build models. However, there is consistency between using both. As can be seen from the following table, the two approaches are similar and only differ in the way the models and the formula are specified.Model | Workflow | |
---|---|---|
Specification |
model <- linear_reg() |
rec_definition <- recipe() %>% add_formula(formula) wf <- workflow() %>% add_model(linear_reg()) %>% add_recipe(rec_definition) |
Validation |
result_cv <- model %>% fit_resamples(formula, resamples) |
result_cv <- wf %>% fit_resamples(resamples) |
Model fit |
fitted_model <- model %>% fit(formula, trainData) |
fitted_model <- wf %>% fit(trainData) |
Prediction |
pred <- fitted_model %>% predict(new_data=newdata) |
pred <- fitted_model %>% predict(new_data=newdata) |
Augmenting a dataset |
aug_data <- fitted_model %>% augment(new_data=newdata) |
aug_data <- fitted_model %>% augment(new_data=newdata) |
As we will see in Chapter 7 and 14, workflows are required to incorprate preprocessing into the model building process and to tune model parameters. It is therefore best, to use workflows and use simple models only when absolutely necessary.
Further information:
- Take the short datacamp course at https://app.datacamp.com/learn/courses/modeling-with-tidymodels-in-r
- Go to https://workflows.tidymodels.org/ to learn more about the
workflows
package - The workflowsets package allows to combine multiple workflows into a single object. This is useful when you want to compare multiple preprocessing steps and/or multiple models at the same time. We will not cover this package in this class.