Chapter 27 Models

As we’ve seen in Chapters 8 and 10 a models requires first defining the model type (e.g. linear_reg) and then select a suitable engine (e.g. lm).

The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip format:

parsnip: parsnip.tidymodels.org
modeltime: business-science.github.io/modeltime/ for time series forecasting

Use the command show_engines(...) to get an overview of all available engines for a given model type.

Code

library(parsnip)
show_engines("linear_reg")

## # A tibble: 8 × 2
##   engine   mode               
##   <chr>    <chr>              
## 1 lm       regression         
## 2 glm      regression         
## 3 glmnet   regression         
## 4 stan     regression         
## 5 spark    regression         
## 6 keras    regression         
## 7 brulee   regression         
## 8 quantreg quantile regression

Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:

Code

nearest_neighbor(mode="regression", neighbors=5, weight_func="triangular") %>%
    set_engine("kknn") %>%
    translate()

## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = 5
##   weight_func = triangular
## 
## Computational engine: kknn 
## 
## Model fit template:
## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
##     ks = min_rows(5, data, 5), kernel = "triangular")

The translate function returns information about how the actual engine is called. In this case, the neighbors parameter is mapped to ks=min_rows(5, data, 5) and weight_func is mapped to kernel="triangular". This information will be useful when you want to understand more about the engine and read the documentation of the engine package.

In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/

27.1 Non-informative model `null_model` (regression and classification)

While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.

Code

null_model(mode="regression") %>%
    set_engine("parsnip")

## Null Model Specification (regression)
## 
## Computational engine: parsnip

Code

null_model(mode="classification") %>%
    set_engine("parsnip")

## Null Model Specification (classification)
## 
## Computational engine: parsnip

27.2 Linear regression models `linear_reg` (regression)

See https://parsnip.tidymodels.org/reference/linear_reg.html for details.

27.2.1 `lm` engine (default)

Code

linear_reg(mode="regression") %>%
    set_engine("lm")

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

No tunable parameters.

27.2.2 `glm` engine (generalized linear model)

The glm engine is a more flexible version of the lm engine. It allows to specify the distribution of the response variable (e.g. gaussian for linear regression, binomial for logistic regression, poisson for count data, etc.) and the link function (e.g. identity for linear regression, logit for logistic regression, log for count data, etc.).

Code

linear_reg(mode="regression") %>%
    set_engine("glm")

## Linear Regression Model Specification (regression)
## 
## Computational engine: glm

No tunable parameters.

Useful to know:

When using the glm or glmnet engine, you might come across this warning:

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred'
    from the logistic model

In general, you can ignore it. It means that the model is very certain about the predicted class.

27.2.3 `glmnet` engine (regularized linear regression)

Code

linear_reg(mode="regression") %>%
    set_engine("glmnet")

## Linear Regression Model Specification (regression)
## 
## Computational engine: glmnet

glmnet supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:

Code

linear_reg(mode="regression", penalty=0.01, mixture=0.5) %>%
    set_engine("glmnet")

## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 0.5
## 
## Computational engine: glmnet

See Chapter 21 for more details
https://parsnip.tidymodels.org/reference/glmnet-details.html

27.3 Partial least squares regression `pls` (regression)

See https://parsnip.tidymodels.org/reference/pls.html for details.

27.3.1 `mixOmics` engine (default)

This engine requires installation of the mixOmics package. See http://mixomics.org/ for details. Use the following to install the package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if (!require("plsmod", quietly = TRUE))
    install.packages("plsmod", repos = "http://cran.us.r-project.org")

BiocManager::install("mixOmics")

Code

library(plsmod)

pls(mode="regression") %>%
    set_engine("mixOmics")

## PLS Model Specification (regression)
## 
## Computational engine: mixOmics

The engine has two tunable parameters, num_comp and predictor_prop.

27.4 Logistic regression models `logistic_reg` (classification)

See https://parsnip.tidymodels.org/reference/logistic_reg.html for details.

27.4.1 `glm` engine (default)

Code

logistic_reg(mode="classification") %>%
    set_engine("glm")

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

See comments above for glm engine in linear regression models (Section 27.2.2).

27.4.2 `glmnet` engine (regularized logistic regression)

Code

logistic_reg(mode="classification") %>%
    set_engine("glmnet")

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glmnet

See comments above for glmnet engine in linear regression models. (Section 27.2.2).

27.5 Nearest Neighbor models (classification and regression)

Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model (i.e. mode="regression" or mode="classification"). See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.

27.5.1 `kknn` engine (default)

kknn is currently the only supported engine. It supports both classification and regression. See https://parsnip.tidymodels.org/reference/kknn.html for details.

Use mode to specify either a classification or regression model:

Code

nearest_neighbor(mode="classification") %>%
    set_engine("kknn")

## K-Nearest Neighbor Model Specification (classification)
## 
## Computational engine: kknn

Code

nearest_neighbor(mode="regression") %>%
    set_engine("kknn")

## K-Nearest Neighbor Model Specification (regression)
## 
## Computational engine: kknn

The engine has several tunable parameters. Here is an example with a k-nearest neighbor model with 5 neighbors and a distance weighting function:

Code

nearest_neighbor(mode="regression", neighbors=5, weight_func="triangular") %>%
    set_engine("kknn")

## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = 5
##   weight_func = triangular
## 
## Computational engine: kknn

A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.

See https://rdrr.io/cran/kknn/man/train.kknn.html for details about the kknn package.

If you cannot install the kknn package from CRAN, you can install it directly from GitHub:

install.packages("devtools")
devtools::install_github("KlausVigo/kknn")

27.6 Linear discriminant analysis `discrim_linear` (classification)

See https://parsnip.tidymodels.org/reference/discrim_linear.html for details.

27.6.1 `MASS` engine (default)

You will need to load the discrim package to use this engine.

Code

library(discrim)

discrim_linear(mode="classification") %>%
    set_engine("MASS")

## Linear Discriminant Model Specification (classification)
## 
## Computational engine: MASS

No tunable parameters.

27.7 Quadratic discriminant analysis `discrim_quad` (classification)

See https://parsnip.tidymodels.org/reference/discrim_quad.html for details.

27.7.1 `MASS` engine (default)

You will need to load the discrim package to use this engine.

Code

library(discrim)

discrim_quad(mode="classification") %>%
    set_engine("MASS")

## Quadratic Discriminant Model Specification (classification)
## 
## Computational engine: MASS

No tunable parameters.

27.8 Generalized additive models `gen_additive_mod` (regression and classification)

See https://parsnip.tidymodels.org/reference/gen_additive_mod.html for details.

27.8.1 `mgcv` engine (default)

You will need to load the mgcv package to use this engine.

Code

# library(mgcv)
gen_additive_mod(mode="regression") %>%
    set_engine("mgcv")

## GAM Model Specification (regression)
## 
## Computational engine: mgcv

The model has two tuning parameters:

select_features (default FALSE): if TRUE, the model will add a penalty term so that terms can be penalized to zero.
adjust_deg_free (default 1): level of penalization; higher values lead to more penalization.

See Chapter @ref(deep-dive-gen_additive_mod) for more details.

27.9 Decision tree models `decision_tree` (classification, regression, and censored regression)

See https://parsnip.tidymodels.org/reference/decision_tree.html for details.

27.9.1 `rpart` engine (default)

Code

decision_tree(mode="regression") %>%
    set_engine("rpart")

## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

The model has three tuning parameters:

tree_depth (default 30): maximum depth of the tree
min_n (default 2): minimum number of observations in a node
cost_complexity (default 0.01): complexity parameter; higher values lead to simpler trees

27.9.2 `partykit` engine

Code

decision_tree(mode="classification") %>%
    set_engine("partykit")

## ! parsnip could not locate an implementation for `decision_tree` classification model
##   specifications using the `partykit` engine.
## ℹ The parsnip extension package bonsai implements support for this specification.
## ℹ Please install (if needed) and load to continue.

## Decision Tree Model Specification (classification)
## 
## Computational engine: partykit

The model has three tuning parameters:

tree_depth: maximum depth of the tree, by default no restriction
min_n (default 20): minimum number of observations in a node
mtry: random number of predictors to try at each split, by default no restriction

The partykit engine requires installation of the partykit and bonsai packages.

27.10 Ensemble models I `bag_tree` (classification and regression)

See https://parsnip.tidymodels.org/reference/bag_tree.html for details.

27.10.1 `rpart` engine (default)

To use this engine, you need to load the baguette package.

Code

library(baguette)

bag_tree(mode="classification") %>%
    set_engine("rpart")

## Bagged Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 0
##   min_n = 2
## 
## Computational engine: rpart

The model has four tuning parameters:

tree_depth (default 30): maximum depth of the tree
min_n (default 2): minimum number of observations in a node
cp (default 0.01): complexity parameter; higher values lead to simpler trees
class_cost (default NULL): cost of misclassifying each class; if NULL, the cost is set to 1 for all classes

27.11 Ensemble models II `boost_tree` (classification and regression)

See https://parsnip.tidymodels.org/reference/boost_tree.html for details.

27.11.1 `xgboost` engine (default)

To use this engine, you need to have the xgboost package installed.

Code

boost_tree(mode="classification") %>%
    set_engine("xgboost")

## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

The model has eight tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

Code

boost_tree(mode="classification", trees=100, learn_rate=0.1, tree_depth=3) %>%
    set_engine("xgboost")

## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   trees = 100
##   tree_depth = 3
##   learn_rate = 0.1
## 
## Computational engine: xgboost

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html.

27.11.2 `lightgbm` engine

To use this engine, you need to have the lightgbm and the bonsai packages installed.

Code

library(bonsai)
boost_tree(mode="regression") %>%
    set_engine("lightgbm")

## Boosted Tree Model Specification (regression)
## 
## Computational engine: lightgbm

This model has six tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

Code

boost_tree(mode="regression", trees=100, learn_rate=0.1, tree_depth=3) %>%
    set_engine("lightgbm")

## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   trees = 100
##   tree_depth = 3
##   learn_rate = 0.1
## 
## Computational engine: lightgbm

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_lightgbm.html.

27.12 Ensemble models III `rand_forest` (classification and regression)

See https://parsnip.tidymodels.org/reference/rand_forest.html for details.

27.12.1 `ranger` engine (default)

To use this engine, you need to have the ranger package installed.

Code

rand_forest(mode="classification") %>%
    set_engine("ranger")

## Random Forest Model Specification (classification)
## 
## Computational engine: ranger

The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

Code

rand_forest(mode="classification", trees=100, min_n=5, mtry=3) %>%
    set_engine("ranger")

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Computational engine: ranger

Default values are:

mtry: number of randomly selected predictors at each split; default is the square root of the number of predictors
min_n: minimum node size; default is 5 for regression and 10 for classification

If you want to extract information about variable importance from the model, you need to set importance="impurity" in the ranger engine:

Code

rand_forest(mode="classification", trees=100, min_n=5, mtry=3) %>%
    set_engine("ranger", importance="impurity")

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

27.12.2 `randomForest` engine

To use this engine, you need to have the randomForest package installed.

Code

rand_forest(mode="regression") %>%
    set_engine("randomForest")

## Random Forest Model Specification (regression)
## 
## Computational engine: randomForest

The ranger package is considerably faster than randomForest, so we recommend using ranger instead. However, if you want to use randomForest, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

Code

rand_forest(mode="regression", trees=100, min_n=5, mtry=3) %>%
    set_engine("randomForest")

## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Computational engine: randomForest

27.13 Support vector machines I `svm_linear` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_linear.html for details.

For SVM models, it is recommended to normalize the predictors to a mean of zero and a variance of one.

27.13.1 `LiblineaR` engine (default)

To use this engine, you need to have the LiblineaR package installed.

Code

svm_linear(mode="classification") %>%
    set_engine("LiblineaR")

## Linear Support Vector Machine Model Specification (classification)
## 
## Computational engine: LiblineaR

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

Code

svm_linear(mode="classification", cost=0.1, margin=1) %>%
    set_engine("LiblineaR")

## Linear Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   margin = 1
## 
## Computational engine: LiblineaR

More details: https://parsnip.tidymodels.org/reference/details_svm_linear_LiblineaR.html

27.13.2 `kernlab` engine

To use this engine, you need to have the kernlab package installed.

Code

svm_linear(mode="regression") %>%
    set_engine("kernlab")

## Linear Support Vector Machine Model Specification (regression)
## 
## Computational engine: kernlab

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

Code

svm_linear(mode="regression", cost=0.1, margin=1) %>%
    set_engine("kernlab")

## Linear Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = 0.1
##   margin = 1
## 
## Computational engine: kernlab

27.14 Support vector machines II `svm_poly` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_poly.html for details.

27.14.1 `kernlab` engine (default)

To use this engine, you need to have the kernlab package installed.

Code

svm_poly(mode="classification") %>%
    set_engine("kernlab")

## Polynomial Support Vector Machine Model Specification (classification)
## 
## Computational engine: kernlab

The model has four tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 1, a scale_factor of 0.75, and a degree of 2:

Code

svm_poly(mode="classification", cost=0.1, margin=1, scale_factor=0.75, degree=2) %>%
    set_engine("kernlab")

## Polynomial Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   degree = 2
##   scale_factor = 0.75
##   margin = 1
## 
## Computational engine: kernlab

27.15 Support vector machines III `svm_rbf` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_rbf.html for details.

27.15.1 `kernlab` engine (default)

To use this engine, you need to have the kernlab package installed.

Code

svm_rbf(mode="classification") %>%
    set_engine("kernlab")

## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Computational engine: kernlab

The model has three tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 0.1, and a rbf_sigma of 0.75:

Code

svm_rbf(mode="classification", cost=0.1, margin=0.1, rbf_sigma=0.75) %>%
    set_engine("kernlab")

## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   rbf_sigma = 0.75
##   margin = 0.1
## 
## Computational engine: kernlab

Chapter 27 Models

27.1 Non-informative model null_model (regression and classification)

27.2 Linear regression models linear_reg (regression)

27.2.1 lm engine (default)

27.2.2 glm engine (generalized linear model)

27.2.3 glmnet engine (regularized linear regression)

27.3 Partial least squares regression pls (regression)

27.3.1 mixOmics engine (default)

27.4 Logistic regression models logistic_reg (classification)

27.4.1 glm engine (default)

27.4.2 glmnet engine (regularized logistic regression)

27.5 Nearest Neighbor models (classification and regression)

27.5.1 kknn engine (default)

27.6 Linear discriminant analysis discrim_linear (classification)

27.6.1 MASS engine (default)

27.7 Quadratic discriminant analysis discrim_quad (classification)

27.7.1 MASS engine (default)

27.8 Generalized additive models gen_additive_mod (regression and classification)

27.8.1 mgcv engine (default)

27.9 Decision tree models decision_tree (classification, regression, and censored regression)

27.9.1 rpart engine (default)

27.9.2 partykit engine

27.10 Ensemble models I bag_tree (classification and regression)

27.10.1 rpart engine (default)

27.11 Ensemble models II boost_tree (classification and regression)

27.11.1 xgboost engine (default)

27.11.2 lightgbm engine

27.12 Ensemble models III rand_forest (classification and regression)

27.12.1 ranger engine (default)

27.12.2 randomForest engine

27.13 Support vector machines I svm_linear (classification and regression)

27.13.1 LiblineaR engine (default)

27.13.2 kernlab engine

27.14 Support vector machines II svm_poly (classification and regression)

27.14.1 kernlab engine (default)

27.15 Support vector machines III svm_rbf (classification and regression)

27.15.1 kernlab engine (default)

27.1 Non-informative model `null_model` (regression and classification)

27.2 Linear regression models `linear_reg` (regression)

27.2.1 `lm` engine (default)

27.2.2 `glm` engine (generalized linear model)

27.2.3 `glmnet` engine (regularized linear regression)

27.3 Partial least squares regression `pls` (regression)

27.3.1 `mixOmics` engine (default)

27.4 Logistic regression models `logistic_reg` (classification)

27.4.1 `glm` engine (default)

27.4.2 `glmnet` engine (regularized logistic regression)

27.5.1 `kknn` engine (default)

27.6 Linear discriminant analysis `discrim_linear` (classification)

27.6.1 `MASS` engine (default)

27.7 Quadratic discriminant analysis `discrim_quad` (classification)

27.7.1 `MASS` engine (default)

27.8 Generalized additive models `gen_additive_mod` (regression and classification)

27.8.1 `mgcv` engine (default)

27.9 Decision tree models `decision_tree` (classification, regression, and censored regression)

27.9.1 `rpart` engine (default)

27.9.2 `partykit` engine

27.10 Ensemble models I `bag_tree` (classification and regression)

27.10.1 `rpart` engine (default)

27.11 Ensemble models II `boost_tree` (classification and regression)

27.11.1 `xgboost` engine (default)

27.11.2 `lightgbm` engine

27.12 Ensemble models III `rand_forest` (classification and regression)

27.12.1 `ranger` engine (default)

27.12.2 `randomForest` engine

27.13 Support vector machines I `svm_linear` (classification and regression)

27.13.1 `LiblineaR` engine (default)

27.13.2 `kernlab` engine

27.14 Support vector machines II `svm_poly` (classification and regression)

27.14.1 `kernlab` engine (default)

27.15 Support vector machines III `svm_rbf` (classification and regression)

27.15.1 `kernlab` engine (default)