Chapter 25 Models

As we’ve seen in Chapters 8 and 10 a models requires first defining the model type (e.g. linear_reg) and then select a suitable engine (e.g. lm).

The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip format:

Use the command show_engines(...) to get an overview of all available engines for a given model type.

Code
library(parsnip)
show_engines("linear_reg")
## # A tibble: 7 × 2
##   engine mode      
##   <chr>  <chr>     
## 1 lm     regression
## 2 glm    regression
## 3 glmnet regression
## 4 stan   regression
## 5 spark  regression
## 6 keras  regression
## 7 brulee regression

Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:

Code
nearest_neighbor(mode="regression", neighbors=5, weight_func="triangular") %>%
    set_engine("kknn") %>%
    translate()
## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = 5
##   weight_func = triangular
## 
## Computational engine: kknn 
## 
## Model fit template:
## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
##     ks = min_rows(5, data, 5), kernel = "triangular")

The translate function returns information about how the actual engine is called. In this case, the neighbors parameter is mapped to ks=min_rows(5, data, 5) and weight_func is mapped to kernel="triangular". This information will be useful when you want to understand more about the engine and read the documentation of the engine package.

In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/

25.1 Non-informative model null_model (regression and classification)

While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.

Code
null_model(mode="regression") %>% 
    set_engine("parsnip")
## Null Model Specification (regression)
## 
## Computational engine: parsnip
Code
null_model(mode="classification") %>%
    set_engine("parsnip")
## Null Model Specification (classification)
## 
## Computational engine: parsnip

25.2 Linear regression models linear_reg (regression)

See https://parsnip.tidymodels.org/reference/linear_reg.html for details.

25.2.1 lm engine (default)

Code
linear_reg(mode="regression") %>%
    set_engine("lm")
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

No tunable parameters.

25.2.2 glm engine (generalized linear model)

The glm engine is a more flexible version of the lm engine. It allows to specify the distribution of the response variable (e.g. gaussian for linear regression, binomial for logistic regression, poisson for count data, etc.) and the link function (e.g. identity for linear regression, logit for logistic regression, log for count data, etc.).

Code
linear_reg(mode="regression") %>%
    set_engine("glm")
## Linear Regression Model Specification (regression)
## 
## Computational engine: glm

No tunable parameters.

Useful to know:

When using the glm or glmnet engine, you might come across this warning:

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred' 
    from the logistic model

In general, you can ignore it. It means that the model is very certain about the predicted class.

25.2.3 glmnet engine (regularized linear regression)

Code
linear_reg(mode="regression") %>%
    set_engine("glmnet")
## Linear Regression Model Specification (regression)
## 
## Computational engine: glmnet

glmnet supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:

Code
linear_reg(mode="regression", penalty=0.01, mixture=0.5) %>%
    set_engine("glmnet")
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.01
##   mixture = 0.5
## 
## Computational engine: glmnet

25.3 Partial least squares regression pls (regression)

See https://parsnip.tidymodels.org/reference/pls.html for details.

25.3.1 mixOmics engine (default)

This engine requires installation of the mixOmics package. See http://mixomics.org/ for details. Use the following to install the package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if (!require("plsmod", quietly = TRUE))
    install.packages("plsmod", repos = "http://cran.us.r-project.org")

BiocManager::install("mixOmics")
Code
library(plsmod)

pls(mode="regression") %>%
    set_engine("mixOmics") 
## PLS Model Specification (regression)
## 
## Computational engine: mixOmics

The engine has two tunable parameters, num_comp and predictor_prop.

25.4 Logistic regression models logistic_reg (classification)

See https://parsnip.tidymodels.org/reference/logistic_reg.html for details.

25.4.1 glm engine (default)

Code
logistic_reg(mode="classification") %>%
    set_engine("glm")
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

See comments above for glm engine in linear regression models (Section 25.2.2).

25.4.2 glmnet engine (regularized logistic regression)

Code
logistic_reg(mode="classification") %>%
    set_engine("glmnet")
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glmnet

See comments above for glmnet engine in linear regression models. (Section 25.2.2).

25.5 Nearest Neighbor models (classification and regression)

Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model (i.e. mode="regression" or mode="classification"). See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.

25.5.1 kknn engine (default)

kknn is currently the only supported engine. It supports both classification and regression. See https://parsnip.tidymodels.org/reference/kknn.html for details.

Use mode to specify either a classification or regression model:

Code
nearest_neighbor(mode="classification") %>%
    set_engine("kknn")
## K-Nearest Neighbor Model Specification (classification)
## 
## Computational engine: kknn
Code
nearest_neighbor(mode="regression") %>%
    set_engine("kknn")
## K-Nearest Neighbor Model Specification (regression)
## 
## Computational engine: kknn

The engine has several tunable parameters. Here is an example with a k-nearest neighbor model with 5 neighbors and a distance weighting function:

Code
nearest_neighbor(mode="regression", neighbors=5, weight_func="triangular") %>%
    set_engine("kknn")
## K-Nearest Neighbor Model Specification (regression)
## 
## Main Arguments:
##   neighbors = 5
##   weight_func = triangular
## 
## Computational engine: kknn

A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.

See https://rdrr.io/cran/kknn/man/train.kknn.html for details about the kknn package.

25.6 Linear discriminant analysis discrim_linear (classification)

See https://parsnip.tidymodels.org/reference/discrim_linear.html for details.

25.6.1 MASS engine (default)

You will need to load the discrim package to use this engine.

Code
library(discrim)

discrim_linear(mode="classification") %>%
    set_engine("MASS")
## Linear Discriminant Model Specification (classification)
## 
## Computational engine: MASS

No tunable parameters.

25.7 Quadratic discriminant analysis discrim_quad (classification)

See https://parsnip.tidymodels.org/reference/discrim_quad.html for details.

25.7.1 MASS engine (default)

You will need to load the discrim package to use this engine.

Code
library(discrim)

discrim_quad(mode="classification") %>%
    set_engine("MASS")
## Quadratic Discriminant Model Specification (classification)
## 
## Computational engine: MASS

No tunable parameters.

25.8 Generalized additive models gen_additive_mod (regression and classification)

See https://parsnip.tidymodels.org/reference/gen_additive_mod.html for details.

25.8.1 mgcv engine (default)

You will need to load the mgcv package to use this engine.

Code
# library(mgcv)
gen_additive_mod(mode="regression") %>%
    set_engine("mgcv")
## GAM Model Specification (regression)
## 
## Computational engine: mgcv

The model has two tuning parameters:

  • select_features (default FALSE): if TRUE, the model will add a penalty term so that terms can be penalized to zero.
  • adjust_deg_free (default 1): level of penalization; higher values lead to more penalization.

See Chapter @ref(deep-dive-gen_additive_mod) for more details.

25.9 Decision tree models decision_tree (classification, regression, and censored regression)

See https://parsnip.tidymodels.org/reference/decision_tree.html for details.

25.9.1 rpart engine (default)

Code
decision_tree(mode="regression") %>%
    set_engine("rpart")
## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

The model has three tuning parameters:

  • tree_depth (default 30): maximum depth of the tree
  • min_n (default 2): minimum number of observations in a node
  • cost_complexity (default 0.01): complexity parameter; higher values lead to simpler trees

25.10 Ensemble models I bag_tree (classification and regression)

See https://parsnip.tidymodels.org/reference/bag_tree.html for details.

25.10.1 rpart engine (default)

To use this engine, you need to load the baguette package.

Code
library(baguette)

bag_tree(mode="classification") %>%
    set_engine("rpart")
## Bagged Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 0
##   min_n = 2
## 
## Computational engine: rpart

The model has four tuning parameters:

  • tree_depth (default 30): maximum depth of the tree
  • min_n (default 2): minimum number of observations in a node
  • cp (default 0.01): complexity parameter; higher values lead to simpler trees
  • class_cost (default NULL): cost of misclassifying each class; if NULL, the cost is set to 1 for all classes

25.11 Ensemble models II boost_tree (classification and regression)

See https://parsnip.tidymodels.org/reference/boost_tree.html for details.

25.11.1 xgboost engine (default)

To use this engine, you need to have the xgboost package installed.

Code
boost_tree(mode="classification") %>%
    set_engine("xgboost")
## Boosted Tree Model Specification (classification)
## 
## Computational engine: xgboost

The model has eight tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

Code
boost_tree(mode="classification", trees=100, learn_rate=0.1, tree_depth=3) %>%
    set_engine("xgboost")
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   trees = 100
##   tree_depth = 3
##   learn_rate = 0.1
## 
## Computational engine: xgboost

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html.

25.11.2 lightgbm engine

To use this engine, you need to have the lightgbm and the bonsai packages installed.

Code
library(bonsai)
boost_tree(mode="regression") %>%
    set_engine("lightgbm")
## Boosted Tree Model Specification (regression)
## 
## Computational engine: lightgbm

This model has six tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:

Code
boost_tree(mode="regression", trees=100, learn_rate=0.1, tree_depth=3) %>%
    set_engine("lightgbm")
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   trees = 100
##   tree_depth = 3
##   learn_rate = 0.1
## 
## Computational engine: lightgbm

For details see https://parsnip.tidymodels.org/reference/details_boost_tree_lightgbm.html.

25.12 Ensemble models III rand_forest (classification and regression)

See https://parsnip.tidymodels.org/reference/rand_forest.html for details.

25.12.1 ranger engine (default)

To use this engine, you need to have the ranger package installed.

Code
rand_forest(mode="classification") %>%
    set_engine("ranger")
## Random Forest Model Specification (classification)
## 
## Computational engine: ranger

The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

Code
rand_forest(mode="classification", trees=100, min_n=5, mtry=3) %>%
    set_engine("ranger")
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Computational engine: ranger

Default values are:

  • mtry: number of randomly selected predictors at each split; default is the square root of the number of predictors
  • min_n: minimum node size; default is 5 for regression and 10 for classification

If you want to extract information about variable importance from the model, you need to set importance="impurity" in the ranger engine:

Code
rand_forest(mode="classification", trees=100, min_n=5, mtry=3) %>%
    set_engine("ranger", importance="impurity")
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

25.12.2 randomForest engine

To use this engine, you need to have the randomForest package installed.

Code
rand_forest(mode="regression") %>%
    set_engine("randomForest")
## Random Forest Model Specification (regression)
## 
## Computational engine: randomForest

The ranger package is considerably faster than randomForest, so we recommend using ranger instead. However, if you want to use randomForest, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:

Code
rand_forest(mode="regression", trees=100, min_n=5, mtry=3) %>%
    set_engine("randomForest")
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 3
##   trees = 100
##   min_n = 5
## 
## Computational engine: randomForest

25.13 Support vector machines I svm_linear (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_linear.html for details.

For SVM models, it is recommended to normalize the predictors to a mean of zero and a variance of one.

25.13.1 LiblineaR engine (default)

To use this engine, you need to have the LiblineaR package installed.

Code
svm_linear(mode="classification") %>%
    set_engine("LiblineaR")
## Linear Support Vector Machine Model Specification (classification)
## 
## Computational engine: LiblineaR

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

Code
svm_linear(mode="classification", cost=0.1, margin=1) %>%
    set_engine("LiblineaR")
## Linear Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   margin = 1
## 
## Computational engine: LiblineaR

More details: https://parsnip.tidymodels.org/reference/details_svm_linear_LiblineaR.html

25.13.2 kernlab engine

To use this engine, you need to have the kernlab package installed.

Code
svm_linear(mode="regression") %>%
    set_engine("kernlab")
## Linear Support Vector Machine Model Specification (regression)
## 
## Computational engine: kernlab

The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:

Code
svm_linear(mode="regression", cost=0.1, margin=1) %>%
    set_engine("kernlab")
## Linear Support Vector Machine Model Specification (regression)
## 
## Main Arguments:
##   cost = 0.1
##   margin = 1
## 
## Computational engine: kernlab

25.14 Support vector machines II svm_poly (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_poly.html for details.

25.14.1 kernlab engine (default)

To use this engine, you need to have the kernlab package installed.

Code
svm_poly(mode="classification") %>%
    set_engine("kernlab")
## Polynomial Support Vector Machine Model Specification (classification)
## 
## Computational engine: kernlab

The model has four tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 1, a scale_factor of 0.75, and a degree of 2:

Code
svm_poly(mode="classification", cost=0.1, margin=1, 
         scale_factor=0.75, degree=2) %>%
    set_engine("kernlab")
## Polynomial Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   degree = 2
##   scale_factor = 0.75
##   margin = 1
## 
## Computational engine: kernlab

25.15 Support vector machines III svm_rbf (classification and regression)

See https://parsnip.tidymodels.org/reference/svm_rbf.html for details.

25.15.1 kernlab engine (default)

To use this engine, you need to have the kernlab package installed.

Code
svm_rbf(mode="classification") %>%
    set_engine("kernlab")
## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Computational engine: kernlab

The model has three tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 0.1, and a rbf_sigma of 0.75:

Code
svm_rbf(mode="classification", cost=0.1, margin=0.1, 
        rbf_sigma=0.75) %>%
    set_engine("kernlab")
## Radial Basis Function Support Vector Machine Model Specification (classification)
## 
## Main Arguments:
##   cost = 0.1
##   rbf_sigma = 0.75
##   margin = 0.1
## 
## Computational engine: kernlab