As we’ve seen in Chapter 8 and Chapter 10, a model requires first defining the model type (e.g. linear_reg) and then select a suitable engine (e.g. lm).
The following packages provide model engines for classification, regression, and censored regression compatible with the parsnip format:
Some of the model types have common tunable parameters, e.g. the number of nearest neighbors in a k-nearest neighbor model. These parameters can be set when defining the model. Parsnip will translate these into the engine specific paramters. For example, the following code defines a k-nearest neighbor model with 5 neighbors and a distance weighting function:
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = 5
weight_func = triangular
Computational engine: kknn
Model fit template:
kknn::train.kknn(formula = missing_arg(), data = missing_arg(),
ks = min_rows(5, data, 5), kernel = "triangular")
The translate function returns information about how the actual engine is called. In this case, the neighbors parameter is mapped to ks=min_rows(5, data, 5) and weight_func is mapped to kernel="triangular". This information will be useful when you want to understand more about the engine and read the documentation of the engine package.
In the following, we cover a selection of models and engines relevant for DS-6030. A full list of all available parsnip models can be found here:https://www.tidymodels.org/find/parsnip/
A.1 Non-informative model null_model (regression and classification)
While not an actual model, training and evaluating a non-informative model is a good baseline to compare other models against. A non-informative model always predicts the mean of the response variable for regression models and the most frequent class for classification models. See https://parsnip.tidymodels.org/reference/null_model.html for details.
Linear Regression Model Specification (regression)
Computational engine: lm
No tunable parameters.
A.2.2glm engine (generalized linear model)
The glm engine is a more flexible version of the lm engine. It allows to specify the distribution of the response variable (e.g. gaussian for linear regression, binomial for logistic regression, poisson for count data, etc.) and the link function (e.g. identity for linear regression, logit for logistic regression, log for count data, etc.).
Linear Regression Model Specification (regression)
Computational engine: glmnet
glmnet supports L1 and L2 regularization. Here is an example with a mixture of L1 and L2 regularization (elastic net) and a regularization parameter of 0.01:
Logistic Regression Model Specification (classification)
Computational engine: glmnet
See comments above for glmnet engine in linear regression models. (Section A.2.3).
A.5 Nearest Neighbor models (classification and regression)
Nearest neighbor models can be used for classification and regression. It is therefore necessary to specify the mode of the model. You must choose classification or regression, i.e. mode = "regression" or mode = "classification". See https://parsnip.tidymodels.org/reference/nearest_neighbor.html for details.
K-Nearest Neighbor Model Specification (regression)
Main Arguments:
neighbors = 5
weight_func = triangular
Computational engine: kknn
A triangular weight function applies more weight to neighbors that are closer to the observation. There are other options; rectangular weights all neighbors equally.
! parsnip could not locate an implementation for `decision_tree` classification
model specifications using the `partykit` engine.
ℹ The parsnip extension package bonsai implements support for this
specification.
ℹ Please install (if needed) and load to continue.
Decision Tree Model Specification (classification)
Computational engine: partykit
The model has three tuning parameters:
tree_depth: maximum depth of the tree, by default no restriction
min_n (default 20): minimum number of observations in a node
mtry: random number of predictors to try at each split, by default no restriction
The partykit engine requires installation of the partykit and bonsai packages.
A.10 Ensemble models I bag_tree (classification and regression)
Random Forest Model Specification (classification)
Computational engine: ranger
The model has three tuning parameters. Here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:
rand_forest(mode ="classification", trees =100, min_n =5, mtry =3) %>%set_engine("ranger")
Random Forest Model Specification (classification)
Main Arguments:
mtry = 3
trees = 100
min_n = 5
Computational engine: ranger
Default values are:
mtry: number of randomly selected predictors at each split; default is the square root of the number of predictors
min_n: minimum node size; default is 5 for regression and 10 for classification
If you want to extract information about variable importance from the model, you need to set importance="impurity" in the ranger engine:
Random Forest Model Specification (regression)
Computational engine: randomForest
The ranger package is considerably faster than randomForest, so we recommend using ranger instead. However, if you want to use randomForest, here is an example with a model with 100 trees, a minimum node size of 5, and number of randomly selected predictors at each split to 3:
rand_forest(mode ="regression", trees =100, min_n =5, mtry =3) %>%set_engine("randomForest")
Random Forest Model Specification (regression)
Main Arguments:
mtry = 3
trees = 100
min_n = 5
Computational engine: randomForest
A.13 Support vector machines I svm_linear (classification and regression)
Radial Basis Function Support Vector Machine Model Specification (classification)
Main Arguments:
cost = 0.1
rbf_sigma = 0.75
Computational engine: kernlab
For regression models, you can also tune the margin parameter.