Appendix A — Models

As we’ve seen in Chapter 8 and Chapter 10, a model requires first defining the model type (e.g. linear_reg) and then select a suitable engine (e.g. lm).

The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip format:

parsnip: parsnip.tidymodels.org
modeltime: business-science.github.io/modeltime/ for time series forecasting

Use the command show_engines(...) to get an overview of all available engines for a given model type.

library(parsnip)
show_engines("linear_reg")

# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:

nearest_neighbor(mode = "regression", neighbors = 5,
  weight_func = "triangular") %>%
  set_engine("kknn") %>%
  translate()

K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5
  weight_func = triangular

Computational engine: kknn 

Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
    ks = min_rows(5, data, 5), kernel = "triangular")

The translate function returns information about how the actual engine is called. In this case, the neighbors parameter is mapped to ks=min_rows(5, data, 5) and weight_func is mapped to kernel="triangular". This information will be useful when you want to understand more about the engine and read the documentation of the engine package.

In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/

A.1 Non-informative model `null_model` (regression and classification)

While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.

null_model(mode = "regression") %>%
  set_engine("parsnip")

Null Model Specification (regression)

Computational engine: parsnip

null_model(mode = "classification") %>%
  set_engine("parsnip")

Null Model Specification (classification)

Computational engine: parsnip

A.2 Linear regression models `linear_reg` (regression)

See https://parsnip.tidymodels.org/reference/linear_reg.html for details.

A.2.1 `lm` engine (default)

linear_reg(mode = "regression") %>%
  set_engine("lm")

Linear Regression Model Specification (regression)

Computational engine: lm

No tunable parameters.

A.2.2 `glm` engine (generalized linear model)

The glm engine is a more flexible version of the lm engine. It allows to specify the distribution of the response variable (e.g. gaussian for linear regression, binomial for logistic regression, poisson for count data, etc.) and the link function (e.g. identity for linear regression, logit for logistic regression, log for count data, etc.).

linear_reg(mode = "regression") %>%
  set_engine("glm")

Linear Regression Model Specification (regression)

Computational engine: glm

No tunable parameters.

Useful to know

When using the glm or glmnet engine, you might come across this warning:

Warning: glm.fit: fitted probabilities numerically 0 or 1
    occurred from the logistic model

In general, you can ignore it. It means that the model is very certain about the predicted class.

A.2.3 `glmnet` engine (regularized linear regression)

linear_reg(mode = "regression") %>%
  set_engine("glmnet")

Linear Regression Model Specification (regression)

Computational engine: glmnet

glmnet supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:

linear_reg(mode = "regression", penalty = 0.01, mixture = 0.5) %>%
  set_engine("glmnet")

Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.01
  mixture = 0.5

Computational engine: glmnet

See Chapter 21 for more details
https://parsnip.tidymodels.org/reference/glmnet-details.html

A.3 Partial least squares regression `pls` (regression)

See https://parsnip.tidymodels.org/reference/pls.html for details.

A.3.1 `mixOmics` engine (default)

This engine requires installation of the mixOmics package. See http://mixomics.org/ for details. Use the following to install the package:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if (!require("plsmod", quietly = TRUE))
  install.packages("plsmod", repos = "http://cran.us.r-project.org")

BiocManager::install("mixOmics")

Once the package is installed, you can use the mixOmics engine:

library(plsmod)

pls(mode = "regression") %>%
  set_engine("mixOmics")

PLS Model Specification (regression)

Computational engine: mixOmics

The engine has two tunable parameters, num_comp and predictor_prop.

A.4 Logistic regression models `logistic_reg` (classification)

See https://parsnip.tidymodels.org/reference/logistic_reg.html for details.

A.4.1 `glm` engine (default)

logistic_reg(mode = "classification") %>%
  set_engine("glm")

Logistic Regression Model Specification (classification)

Computational engine: glm

See comments above for glm engine in linear regression models (Section A.2.2).

A.4.2 `glmnet` engine (regularized logistic regression)

logistic_reg(mode = "classification") %>%
  set_engine("glmnet")

Logistic Regression Model Specification (classification)

Computational engine: glmnet

See comments above for glmnet engine in linear regression models. (Section A.2.3).

A.5 Nearest Neighbor models (classification and regression)

Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model. You must choose classification or regression, i.e. mode = "regression" or mode = "classification". See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.

A.5.1 `kknn` engine (default)

kknn is currently the only supported engine. It supports both classification and regression. See https://parsnip.tidymodels.org/reference/kknn.html for details.

Use mode to specify either a classification or regression model:

nearest_neighbor(mode = "classification") %>%
  set_engine("kknn")

K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

nearest_neighbor(mode = "regression") %>%
  set_engine("kknn")

K-Nearest Neighbor Model Specification (regression)

Computational engine: kknn

The engine has several tunable parameters. Here is an example with a k-nearest neighbor model with 5 neighbors and a distance weighting function:

nearest_neighbor(mode = "regression", neighbors = 5,
  weight_func = "triangular") %>%
  set_engine("kknn")

K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5
  weight_func = triangular

Computational engine: kknn

A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.

See https://rdrr.io/cran/kknn/man/train.kknn.html for details about the kknn package.

If you cannot install the kknn package from CRAN, you can install it directly from GitHub:

install.packages("devtools")
devtools::install_github("KlausVigo/kknn")

A.6 Linear discriminant analysis `discrim_linear` (classification)

See https://parsnip.tidymodels.org/reference/discrim_linear.html for details.

A.6.1 `MASS` engine (default)

You will need to load the discrim package to use this engine.

library(discrim)

discrim_linear(mode = "classification") %>%
  set_engine("MASS")

Linear Discriminant Model Specification (classification)

Computational engine: MASS

No tunable parameters.

A.7 Quadratic discriminant analysis `discrim_quad` (classification)

See https://parsnip.tidymodels.org/reference/discrim_quad.html for details.

A.7.1 `MASS` engine (default)

You will need to load the discrim package to use this engine.

library(discrim)

discrim_quad(mode = "classification") %>%
  set_engine("MASS")

Quadratic Discriminant Model Specification (classification)

Computational engine: MASS

No tunable parameters.

A.8 Generalized additive models `gen_additive_mod` (regression and classification)

See https://parsnip.tidymodels.org/reference/gen_additive_mod.html for details.

A.8.1 `mgcv` engine (default)

You will need to load the mgcv package to use this engine.

library(mgcv)

Loading required package: nlme

This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

gen_additive_mod(mode = "regression") %>%
  set_engine("mgcv")

GAM Model Specification (regression)

Computational engine: mgcv

The model has two tuning parameters:

select_features (default FALSE): if TRUE, the model will add a penalty term so that terms can be penalized to zero.
adjust_deg_free (default 1): level of penalization; higher values lead to more penalization.

See Chapter 22 for more details.

A.9 Decision tree models `decision_tree` (classification, regression, and censored regression)

See https://parsnip.tidymodels.org/reference/decision_tree.html for details.

A.9.1 `rpart` engine (default)

decision_tree(mode = "regression") %>%
  set_engine("rpart")

Decision Tree Model Specification (regression)

Computational engine: rpart

The model has three tuning parameters:

tree_depth (default 30): maximum depth of the tree
min_n (default 2): minimum number of observations in a node
cost_complexity (default 0.01): complexity parameter; higher values lead to simpler trees

A.9.2 `partykit` engine

decision_tree(mode = "classification") %>%
  set_engine("partykit")

! parsnip could not locate an implementation for `decision_tree` classification
  model specifications using the `partykit` engine.
ℹ The parsnip extension package bonsai implements support for this
  specification.
ℹ Please install (if needed) and load to continue.

Decision Tree Model Specification (classification)

Computational engine: partykit

The model has three tuning parameters:

tree_depth: maximum depth of the tree, by default no restriction
min_n (default 20): minimum number of observations in a node
mtry: random number of predictors to try at each split, by default no restriction

The partykit engine requires installation of the partykit and bonsai packages.

A.10 Ensemble models I `bag_tree` (classification and regression)

See https://parsnip.tidymodels.org/reference/bag_tree.html for details.

A.10.1 `rpart` engine (default)

To use this engine, you need to load the baguette package.

library(baguette)

bag_tree(mode = "classification") %>%
  set_engine("rpart")

Bagged Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 0
  min_n = 2

Computational engine: rpart

The model has four tuning parameters:

tree_depth (default 30): maximum depth of the tree
min_n (default 2): minimum number of observations in a node
cp (default 0.01): complexity parameter; higher values lead to simpler trees
class_cost (default NULL): cost of misclassifying each class; if NULL, the cost is set to 1 for all classes

A.11 Ensemble models II `boost_tree` (classification and regression)

See https://parsnip.tidymodels.org/reference/boost_tree.html for details.

A.11.1 `xgboost` engine (default)

To use this engine, you need to have the xgboost package installed.

boost_tree(mode = "classification") %>%
  set_engine("xgboost")

Boosted Tree Model Specification (classification)

Computational engine: xgboost

The model has eight tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

boost_tree(mode = "classification", trees = 100, learn_rate = 0.1,
  tree_depth = 3) %>%
  set_engine("xgboost")

Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 100
  tree_depth = 3
  learn_rate = 0.1

Computational engine: xgboost

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html.

A.11.2 `lightgbm` engine

To use this engine, you need to have the lightgbm and the bonsai packages installed.

library(bonsai)
boost_tree(mode = "regression") %>%
  set_engine("lightgbm")

Boosted Tree Model Specification (regression)

Computational engine: lightgbm

This model has six tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

boost_tree(mode = "regression", trees = 100, learn_rate = 0.1,
  tree_depth = 3) %>%
  set_engine("lightgbm")

Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 100
  tree_depth = 3
  learn_rate = 0.1

Computational engine: lightgbm

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_lightgbm.html.

A.12 Ensemble models III `rand_forest` (classification and regression)

See https://parsnip.tidymodels.org/reference/rand_forest.html for details.

A.12.1 `ranger` engine (default)

To use this engine, you need to have the ranger package installed.

rand_forest(mode = "classification") %>%
  set_engine("ranger")

Random Forest Model Specification (classification)

Computational engine: ranger

The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

rand_forest(mode = "classification", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("ranger")

Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Computational engine: ranger

Default values are:

mtry: number of randomly selected predictors at each split; default is the square root of the number of predictors
min_n: minimum node size; default is 5 for regression and 10 for classification

If you want to extract information about variable importance from the model, you need to set importance="impurity" in the ranger engine:

rand_forest(mode = "classification", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("ranger", importance = "impurity")

Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

A.12.2 `randomForest` engine

To use this engine, you need to have the randomForest package installed.

rand_forest(mode = "regression") %>%
  set_engine("randomForest")

Random Forest Model Specification (regression)

Computational engine: randomForest

The ranger package is considerably faster than randomForest, so we recommend using ranger instead. However, if you want to use randomForest, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

rand_forest(mode = "regression", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("randomForest")

Random Forest Model Specification (regression)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Computational engine: randomForest

A.13 Support vector machines I `svm_linear` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_linear.html for details.

For SVM models, it is recommended to normalize the predictors to a mean of zero and a variance of one.

A.13.1 `LiblineaR` engine (default)

To use this engine, you need to have the LiblineaR package installed.

svm_linear(mode = "classification") %>%
  set_engine("LiblineaR")

Linear Support Vector Machine Model Specification (classification)

Computational engine: LiblineaR

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

svm_linear(mode = "classification", cost = 0.1, margin = 1) %>%
  set_engine("LiblineaR")

Linear Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  margin = 1

Computational engine: LiblineaR

More details: https://parsnip.tidymodels.org/reference/details_svm_linear_LiblineaR.html

A.13.2 `kernlab` engine

To use this engine, you need to have the kernlab package installed.

svm_linear(mode = "regression") %>%
  set_engine("kernlab")

Linear Support Vector Machine Model Specification (regression)

Computational engine: kernlab

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

svm_linear(mode = "regression", cost = 0.1, margin = 1) %>%
  set_engine("kernlab")

Linear Support Vector Machine Model Specification (regression)

Main Arguments:
  cost = 0.1
  margin = 1

Computational engine: kernlab

The margin parameter is not used in regression models.

A.14 Support vector machines II `svm_poly` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_poly.html for details.

A.14.1 `kernlab` engine (default)

To use this engine, you need to have the kernlab package installed.

svm_poly(mode = "classification") %>%
  set_engine("kernlab")

Polynomial Support Vector Machine Model Specification (classification)

Computational engine: kernlab

The model has four tuning parameters. Here is an example with a model with a cost of 0.1, a scale_factor of 0.75, and a degree of 2:

svm_poly(mode = "classification", cost = 0.1, scale_factor = 0.75,
  degree = 2) %>%
  set_engine("kernlab")

Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  degree = 2
  scale_factor = 0.75

Computational engine: kernlab

For regression models, you can also tune the margin parameter.

A.15 Support vector machines III `svm_rbf` (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_rbf.html for details.

A.15.1 `kernlab` engine (default)

To use this engine, you need to have the kernlab package installed.

svm_rbf(mode = "classification") %>%
  set_engine("kernlab")

Radial Basis Function Support Vector Machine Model Specification (classification)

Computational engine: kernlab

The model has three tuning parameters. Here is an example with a model with a cost of 0.1 and a rbf_sigma of 0.75:

svm_rbf(mode = "classification", cost = 0.1, rbf_sigma = 0.75) %>%
  set_engine("kernlab")

Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  rbf_sigma = 0.75

Computational engine: kernlab

For regression models, you can also tune the margin parameter.

A.1 Non-informative model null_model (regression and classification)

A.2 Linear regression models linear_reg (regression)

A.2.1 lm engine (default)

A.2.2 glm engine (generalized linear model)

A.2.3 glmnet engine (regularized linear regression)

A.3 Partial least squares regression pls (regression)

A.3.1 mixOmics engine (default)

A.4 Logistic regression models logistic_reg (classification)

A.4.1 glm engine (default)

A.4.2 glmnet engine (regularized logistic regression)

A.5 Nearest Neighbor models (classification and regression)

A.5.1 kknn engine (default)

A.6 Linear discriminant analysis discrim_linear (classification)

A.6.1 MASS engine (default)

A.7 Quadratic discriminant analysis discrim_quad (classification)

A.7.1 MASS engine (default)

A.8 Generalized additive models gen_additive_mod (regression and classification)

A.8.1 mgcv engine (default)

A.9 Decision tree models decision_tree (classification, regression, and censored regression)

A.9.1 rpart engine (default)

A.9.2 partykit engine

A.10 Ensemble models I bag_tree (classification and regression)

A.10.1 rpart engine (default)

A.11 Ensemble models II boost_tree (classification and regression)

A.11.1 xgboost engine (default)

A.11.2 lightgbm engine

A.12 Ensemble models III rand_forest (classification and regression)

A.12.1 ranger engine (default)

A.12.2 randomForest engine

A.13 Support vector machines I svm_linear (classification and regression)

A.13.1 LiblineaR engine (default)

A.13.2 kernlab engine

A.14 Support vector machines II svm_poly (classification and regression)

A.14.1 kernlab engine (default)

A.15 Support vector machines III svm_rbf (classification and regression)

A.15.1 kernlab engine (default)

A.1 Non-informative model `null_model` (regression and classification)

A.2 Linear regression models `linear_reg` (regression)

A.2.1 `lm` engine (default)

A.2.2 `glm` engine (generalized linear model)

A.2.3 `glmnet` engine (regularized linear regression)

A.3 Partial least squares regression `pls` (regression)

A.3.1 `mixOmics` engine (default)

A.4 Logistic regression models `logistic_reg` (classification)

A.4.1 `glm` engine (default)

A.4.2 `glmnet` engine (regularized logistic regression)

A.5.1 `kknn` engine (default)

A.6 Linear discriminant analysis `discrim_linear` (classification)

A.6.1 `MASS` engine (default)

A.7 Quadratic discriminant analysis `discrim_quad` (classification)

A.7.1 `MASS` engine (default)

A.8 Generalized additive models `gen_additive_mod` (regression and classification)

A.8.1 `mgcv` engine (default)

A.9 Decision tree models `decision_tree` (classification, regression, and censored regression)

A.9.1 `rpart` engine (default)

A.9.2 `partykit` engine

A.10 Ensemble models I `bag_tree` (classification and regression)

A.10.1 `rpart` engine (default)

A.11 Ensemble models II `boost_tree` (classification and regression)

A.11.1 `xgboost` engine (default)

A.11.2 `lightgbm` engine

A.12 Ensemble models III `rand_forest` (classification and regression)

A.12.1 `ranger` engine (default)

A.12.2 `randomForest` engine

A.13 Support vector machines I `svm_linear` (classification and regression)

A.13.1 `LiblineaR` engine (default)

A.13.2 `kernlab` engine

A.14 Support vector machines II `svm_poly` (classification and regression)

A.14.1 `kernlab` engine (default)

A.15 Support vector machines III `svm_rbf` (classification and regression)

A.15.1 `kernlab` engine (default)