Chapter 28 Models
As we’ve seen in Chapters 8 and 10 a models requires first defining the model type (e.g. linear_reg
) and then select a suitable engine (e.g. lm
).
The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip
format:
parsnip
: parsnip.tidymodels.orgmodeltime
: business-science.github.io/modeltime/ for time series forecasting
Use the command show_engines(...)
to get an overview of all available engines for a given model type.
## # A tibble: 7 × 2
## engine mode
## <chr> <chr>
## 1 lm regression
## 2 glm regression
## 3 glmnet regression
## 4 stan regression
## 5 spark regression
## 6 keras regression
## 7 brulee regression
Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip
will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:
Code
## K-Nearest Neighbor Model Specification (regression)
##
## Main Arguments:
## neighbors = 5
## weight_func = triangular
##
## Computational engine: kknn
##
## Model fit template:
## kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
## ks = min_rows(5, data, 5), kernel = "triangular")
The translate
function returns information about how the actual engine is called. In this case, the neighbors
parameter is mapped to ks=min_rows(5, data, 5)
and weight_func
is mapped to kernel="triangular"
. This information will be useful when you want to understand more about the engine and read the documentation of the engine package.
In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/
28.1 Non-informative model null_model
(regression and classification)
While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.
## Null Model Specification (regression)
##
## Computational engine: parsnip
## Null Model Specification (classification)
##
## Computational engine: parsnip
28.2 Linear regression models linear_reg
(regression)
See https://parsnip.tidymodels.org/reference/linear_reg.html for details.
28.2.1 lm
engine (default)
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
No tunable parameters.
28.2.2 glm
engine (generalized linear model)
The glm
engine is a more flexible version of the lm
engine. It allows to specify the distribution of the response variable (e.g. gaussian
for linear regression, binomial
for logistic regression, poisson
for count data, etc.) and the link function (e.g. identity
for linear regression, logit
for logistic regression, log
for count data, etc.).
## Linear Regression Model Specification (regression)
##
## Computational engine: glm
No tunable parameters.
Useful to know:
When using the glm
or glmnet
engine, you might come across this warning:
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred'
from the logistic model
In general, you can ignore it. It means that the model is very certain about the predicted class.
28.2.3 glmnet
engine (regularized linear regression)
## Linear Regression Model Specification (regression)
##
## Computational engine: glmnet
glmnet
supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 0.01
## mixture = 0.5
##
## Computational engine: glmnet
- See Chapter 21 for more details
- https://parsnip.tidymodels.org/reference/glmnet-details.html
28.3 Partial least squares regression pls
(regression)
See https://parsnip.tidymodels.org/reference/pls.html for details.
28.3.1 mixOmics
engine (default)
This engine requires installation of the mixOmics
package.
See http://mixomics.org/ for details.
Use the following to install the package:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos = "http://cran.us.r-project.org")
if (!require("plsmod", quietly = TRUE))
install.packages("plsmod", repos = "http://cran.us.r-project.org")
BiocManager::install("mixOmics")
## PLS Model Specification (regression)
##
## Computational engine: mixOmics
The engine has two tunable parameters, num_comp
and predictor_prop
.
28.4 Logistic regression models logistic_reg
(classification)
See https://parsnip.tidymodels.org/reference/logistic_reg.html for details.
28.4.1 glm
engine (default)
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
See comments above for glm
engine in linear regression models (Section 28.2.2).
28.4.2 glmnet
engine (regularized logistic regression)
## Logistic Regression Model Specification (classification)
##
## Computational engine: glmnet
See comments above for glmnet
engine in linear regression models. (Section 28.2.2).
28.5 Nearest Neighbor models (classification and regression)
Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model (i.e. mode="regression"
or mode="classification"
). See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.
28.5.1 kknn
engine (default)
kknn
is currently the only supported engine. It supports both classification and regression. See https://parsnip.tidymodels.org/reference/kknn.html for details.
Use mode
to specify either a classification or regression model:
## K-Nearest Neighbor Model Specification (classification)
##
## Computational engine: kknn
## K-Nearest Neighbor Model Specification (regression)
##
## Computational engine: kknn
The engine has several tunable parameters. Here is an example with a k-nearest neighbor model with 5 neighbors and a distance weighting function:
Code
## K-Nearest Neighbor Model Specification (regression)
##
## Main Arguments:
## neighbors = 5
## weight_func = triangular
##
## Computational engine: kknn
A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.
See https://rdrr.io/cran/kknn/man/train.kknn.html for details about the kknn
package.
28.6 Linear discriminant analysis discrim_linear
(classification)
See https://parsnip.tidymodels.org/reference/discrim_linear.html for details.
28.7 Quadratic discriminant analysis discrim_quad
(classification)
See https://parsnip.tidymodels.org/reference/discrim_quad.html for details.
28.8 Generalized additive models gen_additive_mod
(regression and classification)
See https://parsnip.tidymodels.org/reference/gen_additive_mod.html for details.
28.8.1 mgcv
engine (default)
You will need to load the mgcv
package to use this engine.
## GAM Model Specification (regression)
##
## Computational engine: mgcv
The model has two tuning parameters:
select_features
(default FALSE): if TRUE, the model will add a penalty term so that terms can be penalized to zero.adjust_deg_free
(default 1): level of penalization; higher values lead to more penalization.
See Chapter @ref(deep-dive-gen_additive_mod) for more details.
28.9 Decision tree models decision_tree
(classification, regression, and censored regression)
See https://parsnip.tidymodels.org/reference/decision_tree.html for details.
28.9.1 rpart
engine (default)
## Decision Tree Model Specification (regression)
##
## Computational engine: rpart
The model has three tuning parameters:
tree_depth
(default 30): maximum depth of the treemin_n
(default 2): minimum number of observations in a nodecost_complexity
(default 0.01): complexity parameter; higher values lead to simpler trees
28.10 Ensemble models I bag_tree
(classification and regression)
See https://parsnip.tidymodels.org/reference/bag_tree.html for details.
28.10.1 rpart
engine (default)
To use this engine, you need to load the baguette
package.
## Bagged Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 0
## min_n = 2
##
## Computational engine: rpart
The model has four tuning parameters:
tree_depth
(default 30): maximum depth of the treemin_n
(default 2): minimum number of observations in a nodecp
(default 0.01): complexity parameter; higher values lead to simpler treesclass_cost
(default NULL): cost of misclassifying each class; if NULL, the cost is set to 1 for all classes
28.11 Ensemble models II boost_tree
(classification and regression)
See https://parsnip.tidymodels.org/reference/boost_tree.html for details.
28.11.1 xgboost
engine (default)
To use this engine, you need to have the xgboost
package installed.
## Boosted Tree Model Specification (classification)
##
## Computational engine: xgboost
The model has eight tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:
Code
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## trees = 100
## tree_depth = 3
## learn_rate = 0.1
##
## Computational engine: xgboost
For details see https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html.
28.11.2 lightgbm
engine
To use this engine, you need to have the lightgbm
and the bonsai
packages installed.
## Boosted Tree Model Specification (regression)
##
## Computational engine: lightgbm
This model has six tuning parameters. Here is an example with a model with 100 trees, a learning rate of 0.1, and a maximum tree depth of 3:
Code
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## trees = 100
## tree_depth = 3
## learn_rate = 0.1
##
## Computational engine: lightgbm
For details see https://parsnip.tidymodels.org/reference/details_boost_tree_lightgbm.html.
28.12 Ensemble models III rand_forest
(classification and regression)
See https://parsnip.tidymodels.org/reference/rand_forest.html for details.
28.12.1 ranger
engine (default)
To use this engine, you need to have the ranger
package installed.
## Random Forest Model Specification (classification)
##
## Computational engine: ranger
The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 3
## trees = 100
## min_n = 5
##
## Computational engine: ranger
Default values are:
mtry
: number of randomly selected predictors at each split; default is the square root of the number of predictorsmin_n
: minimum node size; default is 5 for regression and 10 for classification
If you want to extract information about variable importance from the model, you need to set importance="impurity"
in the ranger
engine:
Code
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 3
## trees = 100
## min_n = 5
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
28.12.2 randomForest
engine
To use this engine, you need to have the randomForest
package installed.
## Random Forest Model Specification (regression)
##
## Computational engine: randomForest
The ranger
package is considerably faster than randomForest
, so we recommend using ranger
instead. However, if you want to use randomForest
, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 3
## trees = 100
## min_n = 5
##
## Computational engine: randomForest
28.13 Support vector machines I svm_linear
(classification and regression)
See https://parsnip.tidymodels.org/reference/svm_linear.html for details.
For SVM models, it is recommended to normalize the predictors to a mean of zero and a variance of one.
28.13.1 LiblineaR
engine (default)
To use this engine, you need to have the LiblineaR
package installed.
## Linear Support Vector Machine Model Specification (classification)
##
## Computational engine: LiblineaR
The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:
## Linear Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = 0.1
## margin = 1
##
## Computational engine: LiblineaR
More details: https://parsnip.tidymodels.org/reference/details_svm_linear_LiblineaR.html
28.13.2 kernlab
engine
To use this engine, you need to have the kernlab
package installed.
## Linear Support Vector Machine Model Specification (regression)
##
## Computational engine: kernlab
The model has two tuning parameters. Here is an example with a model with a cost of 0.1 and a margin of 1:
## Linear Support Vector Machine Model Specification (regression)
##
## Main Arguments:
## cost = 0.1
## margin = 1
##
## Computational engine: kernlab
28.14 Support vector machines II svm_poly
(classification and regression)
See https://parsnip.tidymodels.org/reference/svm_poly.html for details.
28.14.1 kernlab
engine (default)
To use this engine, you need to have the kernlab
package installed.
## Polynomial Support Vector Machine Model Specification (classification)
##
## Computational engine: kernlab
The model has four tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 1, a scale_factor of 0.75, and a degree of 2:
Code
## Polynomial Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = 0.1
## degree = 2
## scale_factor = 0.75
## margin = 1
##
## Computational engine: kernlab
28.15 Support vector machines III svm_rbf
(classification and regression)
See https://parsnip.tidymodels.org/reference/svm_rbf.html for details.
28.15.1 kernlab
engine (default)
To use this engine, you need to have the kernlab
package installed.
## Radial Basis Function Support Vector Machine Model Specification (classification)
##
## Computational engine: kernlab
The model has three tuning parameters. Here is an example with a model with a cost of 0.1, a margin of 0.1, and a rbf_sigma of 0.75:
## Radial Basis Function Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = 0.1
## rbf_sigma = 0.75
## margin = 0.1
##
## Computational engine: kernlab