Appendix A — Models

As we’ve seen in Chapter 8 and Chapter 10, a model requires first defining the model type (e.g. linear_reg) and then select a suitable engine (e.g. lm).

The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip format:

Use the command show_engines(...) to get an overview of all available engines for a given model type.

library(parsnip)
show_engines("linear_reg")
# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:

nearest_neighbor(mode = "regression", neighbors = 5,
  weight_func = "triangular") %>%
  set_engine("kknn") %>%
  translate()
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5
  weight_func = triangular

Computational engine: kknn 

Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
    ks = min_rows(5, data, 5), kernel = "triangular")

The translate function returns information about how the actual engine is called. In this case, the neighbors parameter is mapped to ks=min_rows(5, data, 5) and weight_func is mapped to kernel="triangular". This information will be useful when you want to understand more about the engine and read the documentation of the engine package.

In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/

A.1 Non-informative model null_model (regression and classification)

While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.

null_model(mode = "regression") %>%
  set_engine("parsnip")
Null Model Specification (regression)

Computational engine: parsnip 
null_model(mode = "classification") %>%
  set_engine("parsnip")
Null Model Specification (classification)

Computational engine: parsnip 

A.2 Linear regression models linear_reg (regression)

See https://parsnip.tidymodels.org/reference/linear_reg.html for details.

A.2.1 lm engine (default)

linear_reg(mode = "regression") %>%
  set_engine("lm")
Linear Regression Model Specification (regression)

Computational engine: lm 

No tunable parameters.

A.2.2 glm engine (generalized linear model)

The glm engine is a more flexible version of the lm engine. It allows to specify the distribution of the response variable (e.g. gaussian for linear regression, binomial for logistic regression, poisson for count data, etc.) and the link function (e.g. identity for linear regression, logit for logistic regression, log for count data, etc.).

linear_reg(mode = "regression") %>%
  set_engine("glm")
Linear Regression Model Specification (regression)

Computational engine: glm 

No tunable parameters.

TipUseful to know

When using the glm or glmnet engine, you might come across this warning:

Warning: glm.fit: fitted probabilities numerically 0 or 1
    occurred from the logistic model

In general, you can ignore it. It means that the model is very certain about the predicted class.

A.2.3 glmnet engine (regularized linear regression)

linear_reg(mode = "regression") %>%
  set_engine("glmnet")
Linear Regression Model Specification (regression)

Computational engine: glmnet 

glmnet supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:

linear_reg(mode = "regression", penalty = 0.01, mixture = 0.5) %>%
  set_engine("glmnet")
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.01
  mixture = 0.5

Computational engine: glmnet 

A.3 Partial least squares regression pls (regression)

See https://parsnip.tidymodels.org/reference/pls.html for details.

A.3.1 mixOmics engine (default)

This engine requires installation of the mixOmics package. See http://mixomics.org/ for details. Use the following to install the package:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if (!require("plsmod", quietly = TRUE))
  install.packages("plsmod", repos = "http://cran.us.r-project.org")

BiocManager::install("mixOmics")

Once the package is installed, you can use the mixOmics engine:

library(plsmod)

pls(mode = "regression") %>%
  set_engine("mixOmics")
PLS Model Specification (regression)

Computational engine: mixOmics 

The engine has two tunable parameters, num_comp and predictor_prop.

A.4 Logistic regression models logistic_reg (classification)

See https://parsnip.tidymodels.org/reference/logistic_reg.html for details.

A.4.1 glm engine (default)

logistic_reg(mode = "classification") %>%
  set_engine("glm")
Logistic Regression Model Specification (classification)

Computational engine: glm 

See comments above for glm engine in linear regression models (Section A.2.2).

A.4.2 glmnet engine (regularized logistic regression)

logistic_reg(mode = "classification") %>%
  set_engine("glmnet")
Logistic Regression Model Specification (classification)

Computational engine: glmnet 

See comments above for glmnet engine in linear regression models. (Section A.2.3).

A.5 Nearest Neighbor models (classification and regression)

Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model. You must choose classification or regression, i.e. mode = "regression" or mode = "classification". See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.

A.5.1 kknn engine (default)

kknn is currently the only supported engine. It supports both classification and regression. See https://parsnip.tidymodels.org/reference/kknn.html for details.

Use mode to specify either a classification or regression model:

nearest_neighbor(mode = "classification") %>%
  set_engine("kknn")
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn 
nearest_neighbor(mode = "regression") %>%
  set_engine("kknn")
K-Nearest Neighbor Model Specification (regression)

Computational engine: kknn 

The engine has several tunable parameters. Here is an example with a k-nearest neighbor model with 5 neighbors and a distance weighting function:

nearest_neighbor(mode = "regression", neighbors = 5,
  weight_func = "triangular") %>%
  set_engine("kknn")
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 5
  weight_func = triangular

Computational engine: kknn 

A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.

See https://rdrr.io/cran/kknn/man/train.kknn.html for details about the kknn package.

If you cannot install the kknn package from CRAN, you can install it directly from GitHub:

install.packages("devtools")
devtools::install_github("KlausVigo/kknn")

A.6 Linear discriminant analysis discrim_linear (classification)

See https://parsnip.tidymodels.org/reference/discrim_linear.html for details.

A.6.1 MASS engine (default)

You will need to load the discrim package to use this engine.

library(discrim)

discrim_linear(mode = "classification") %>%
  set_engine("MASS")
Linear Discriminant Model Specification (classification)

Computational engine: MASS 

No tunable parameters.

A.7 Quadratic discriminant analysis discrim_quad (classification)

See https://parsnip.tidymodels.org/reference/discrim_quad.html for details.

A.7.1 MASS engine (default)

You will need to load the discrim package to use this engine.

library(discrim)

discrim_quad(mode = "classification") %>%
  set_engine("MASS")
Quadratic Discriminant Model Specification (classification)

Computational engine: MASS 

No tunable parameters.

A.8 Generalized additive models gen_additive_mod (regression and classification)

See https://parsnip.tidymodels.org/reference/gen_additive_mod.html for details.

A.8.1 mgcv engine (default)

You will need to load the mgcv package to use this engine.

library(mgcv)
Loading required package: nlme
This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
gen_additive_mod(mode = "regression") %>%
  set_engine("mgcv")
GAM Model Specification (regression)

Computational engine: mgcv 

The model has two tuning parameters:

  • select_features (default FALSE): if TRUE, the model will add a penalty term so that terms can be penalized to zero.
  • adjust_deg_free (default 1): level of penalization; higher values lead to more penalization.

See Chapter 22 for more details.

A.9 Decision tree models decision_tree (classification, regression, and censored regression)

See https://parsnip.tidymodels.org/reference/decision_tree.html for details.

A.9.1 rpart engine (default)

decision_tree(mode = "regression") %>%
  set_engine("rpart")
Decision Tree Model Specification (regression)

Computational engine: rpart 

The model has three tuning parameters:

  • tree_depth (default 30): maximum depth of the tree
  • min_n (default 2): minimum number of observations in a node
  • cost_complexity (default 0.01): complexity parameter; higher values lead to simpler trees

A.9.2 partykit engine

decision_tree(mode = "classification") %>%
  set_engine("partykit")
! parsnip could not locate an implementation for `decision_tree` classification
  model specifications using the `partykit` engine.
ℹ The parsnip extension package bonsai implements support for this
  specification.
ℹ Please install (if needed) and load to continue.
Decision Tree Model Specification (classification)

Computational engine: partykit 

The model has three tuning parameters:

  • tree_depth: maximum depth of the tree, by default no restriction
  • min_n (default 20): minimum number of observations in a node
  • mtry: random number of predictors to try at each split, by default no restriction

The partykit engine requires installation of the partykit and bonsai packages.

A.10 Ensemble models I bag_tree (classification and regression)

See https://parsnip.tidymodels.org/reference/bag_tree.html for details.

A.10.1 rpart engine (default)

To use this engine, you need to load the baguette package.

library(baguette)

bag_tree(mode = "classification") %>%
  set_engine("rpart")
Bagged Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 0
  min_n = 2

Computational engine: rpart 

The model has four tuning parameters:

  • tree_depth (default 30): maximum depth of the tree
  • min_n (default 2): minimum number of observations in a node
  • cp (default 0.01): complexity parameter; higher values lead to simpler trees
  • class_cost (default NULL): cost of misclassifying each class; if NULL, the cost is set to 1 for all classes

A.11 Ensemble models II boost_tree (classification and regression)

See https://parsnip.tidymodels.org/reference/boost_tree.html for details.

A.11.1 xgboost engine (default)

To use this engine, you need to have the xgboost package installed.

boost_tree(mode = "classification") %>%
  set_engine("xgboost")
Boosted Tree Model Specification (classification)

Computational engine: xgboost 

The model has eight tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

boost_tree(mode = "classification", trees = 100, learn_rate = 0.1,
  tree_depth = 3) %>%
  set_engine("xgboost")
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 100
  tree_depth = 3
  learn_rate = 0.1

Computational engine: xgboost 

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html.

A.11.2 lightgbm engine

To use this engine, you need to have the lightgbm and the bonsai packages installed.

library(bonsai)
boost_tree(mode = "regression") %>%
  set_engine("lightgbm")
Boosted Tree Model Specification (regression)

Computational engine: lightgbm 

This model has six tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

boost_tree(mode = "regression", trees = 100, learn_rate = 0.1,
  tree_depth = 3) %>%
  set_engine("lightgbm")
Boosted Tree Model Specification (regression)

Main Arguments:
  trees = 100
  tree_depth = 3
  learn_rate = 0.1

Computational engine: lightgbm 

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_lightgbm.html.

A.12 Ensemble models III rand_forest (classification and regression)

See https://parsnip.tidymodels.org/reference/rand_forest.html for details.

A.12.1 ranger engine (default)

To use this engine, you need to have the ranger package installed.

rand_forest(mode = "classification") %>%
  set_engine("ranger")
Random Forest Model Specification (classification)

Computational engine: ranger 

The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

rand_forest(mode = "classification", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("ranger")
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Computational engine: ranger 

Default values are:

  • mtry: number of randomly selected predictors at each split; default is the square root of the number of predictors
  • min_n: minimum node size; default is 5 for regression and 10 for classification

If you want to extract information about variable importance from the model, you need to set importance="impurity" in the ranger engine:

rand_forest(mode = "classification", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("ranger", importance = "impurity")
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 

A.12.2 randomForest engine

To use this engine, you need to have the randomForest package installed.

rand_forest(mode = "regression") %>%
  set_engine("randomForest")
Random Forest Model Specification (regression)

Computational engine: randomForest 

The ranger package is considerably faster than randomForest, so we recommend using ranger instead. However, if you want to use randomForest, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

rand_forest(mode = "regression", trees = 100, min_n = 5, mtry = 3) %>%
  set_engine("randomForest")
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 3
  trees = 100
  min_n = 5

Computational engine: randomForest 

A.13 Support vector machines I svm_linear (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_linear.html for details.

For SVM models, it is recommended to normalize the predictors to a mean of zero and a variance of one.

A.13.1 LiblineaR engine (default)

To use this engine, you need to have the LiblineaR package installed.

svm_linear(mode = "classification") %>%
  set_engine("LiblineaR")
Linear Support Vector Machine Model Specification (classification)

Computational engine: LiblineaR 

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

svm_linear(mode = "classification", cost = 0.1, margin = 1) %>%
  set_engine("LiblineaR")
Linear Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  margin = 1

Computational engine: LiblineaR 

More details: https://parsnip.tidymodels.org/reference/details_svm_linear_LiblineaR.html

A.13.2 kernlab engine

To use this engine, you need to have the kernlab package installed.

svm_linear(mode = "regression") %>%
  set_engine("kernlab")
Linear Support Vector Machine Model Specification (regression)

Computational engine: kernlab 

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

svm_linear(mode = "regression", cost = 0.1, margin = 1) %>%
  set_engine("kernlab")
Linear Support Vector Machine Model Specification (regression)

Main Arguments:
  cost = 0.1
  margin = 1

Computational engine: kernlab 

The margin parameter is not used in regression models.

A.14 Support vector machines II svm_poly (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_poly.html for details.

A.14.1 kernlab engine (default)

To use this engine, you need to have the kernlab package installed.

svm_poly(mode = "classification") %>%
  set_engine("kernlab")
Polynomial Support Vector Machine Model Specification (classification)

Computational engine: kernlab 

The model has four tuning parameters. Here is an example with a model with a cost of 0.1, a scale_factor of 0.75, and a degree of 2:

svm_poly(mode = "classification", cost = 0.1, scale_factor = 0.75,
  degree = 2) %>%
  set_engine("kernlab")
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  degree = 2
  scale_factor = 0.75

Computational engine: kernlab 

For regression models, you can also tune the margin parameter.

A.15 Support vector machines III svm_rbf (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_rbf.html for details.

A.15.1 kernlab engine (default)

To use this engine, you need to have the kernlab package installed.

svm_rbf(mode = "classification") %>%
  set_engine("kernlab")
Radial Basis Function Support Vector Machine Model Specification (classification)

Computational engine: kernlab 

The model has three tuning parameters. Here is an example with a model with a cost of 0.1 and a rbf_sigma of 0.75:

svm_rbf(mode = "classification", cost = 0.1, rbf_sigma = 0.75) %>%
  set_engine("kernlab")
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.1
  rbf_sigma = 0.75

Computational engine: kernlab 

For regression models, you can also tune the margin parameter.