Chapter 12 Sampling from a dataset
Sampling from a dataset is at the core of all validation and tuning methods. In most case, the sampling is done using a random process. There are two ways of sampling from a dataset:
- Sampling without replacement: each observation can be sampled only once. We use this approach for creating a holdout set and for cross-validation.
- Sampling with replacement: the same observation can be sampled more than once. This approach to sampling is at the core of the bootstrap method.
It can also be important to consider the structure of the dataset when sampling. With structure, we mean the distribution of the data with respect to the predictors or the data. This type of sampling is called stratified sampling. With stratified sampling, you can make sure that the distribution of the data is preserved in the sample. For example, if you have a binary outcome, you can make sure that the sample contains the same proportion of positive and negative outcomes as the original dataset. For continuous outcomes, it would mean that the sample has the same distribution of the outcome as the original dataset.
You can see the effect of stratified sampling for a continuous variable in Figure 12.2. The two panels shows the distribution of the continuous variable for a random sample (left) and a stratified sample (right). The original distribution is shown in red and the distribution of the samples in grey. We can clearly see that stratified sampling preserves the distribution of the data better.
Useful to know:
Sampling, like all calculations that depend on randomness, uses a random number generator which means that repeated execution of the same code will lead to different results. You can make your calculations reproducible by setting a seed using the set.seed()
function. The seed can be any number. The same seed will always produce the same random numbers. In general, it is good practice to set a seed to make sure that the results do not change between runs. You will see that we use the set.seed()
function in the examples below.
12.1 Sampling in statistical modeling
In statistical modeling, we use sampling to create various subsets of the data. Common scenarios are:
- Split the data randomly into training and holdout set; use the training set to fit the model and the holdout set to evaluate the model. This approach can be taken if your model doesn’t equire tuning. Using the holdout set to evaluate the model is a way to get an unbiased estimate of the model performance. It is good practice to repeat this process several times to get a more stable estimate of the model performance. This is called repeated holdout (see Figure 12.3)
- Split the data randomly into training, validation, and holdout set; use the training set to fit various models, select a specific model using the performance on the validation set, and use the holdout set to evaluate the final model. This approach can be taken if you have sufficient data and training the model is costly. See Figure 12.4 for an illustration of this approach.
- The previous scenario relies on a single training/validation split for comparing the performance of different models. A more robust approach is to use cross-validation. In cross-validation, the data is split systematically into \(k\) folds. Each fold is used as a validation set and the model is trained on the remaining \(k-1\) folds. In total, we train \(k\) models. Each of the folds is used as a validation set once. The performance of the model is then averaged over the \(k\) folds. You repeat this \(k\)-fold cross validation approach for each of the models you want to compare and pick the best model based on the estimated performance. Finally, holdout set is used to make a decision on deploying the model or not. See Figure 12.5 for an illustration of this approach.
- A similar approach to cross-validation is using bootstrap. Here the data are split randomly with replacement into a training and a validation set. The training set is used to train a model and the validation set to estimate the performance. Because of sampling with replacement, the training set will contain duplicates and the validation set will have different size for each bootstrap sample. However, because we repeat this bootstrap splitting several times, each data point will eventually be used in training and in validation. It’s up to you how many bootstrap samples you create. Once we evaluated each bootstrap sample, the performance estimates are combined and used to compare the various models and pick the best model based on the estimated performance. It is best to use the same splittings for each of the models. See Figure 12.6 for an illustration of this approach.
There are more variations to these approaches. For example, you can use nested cross-validation to tune the hyperparameters of a model. In this case, you use cross-validation to compare different models and then use cross-validation again to tune the hyperparameters of the best model. There are also specialized approaches for time series data. We will not cover these approaches in this class.
In this section, we will see how to implement these approaches using the rsample
package which is part of tidymodels.
Useful to know:
If you look at the literature, you will find that the terminology is not always consistent. For example, some authors use the term validation set to refer to the holdout set. In this class, we will use the term validation set to refer to the set that is used to compare different models and holdout set to refer to the set that is used to evaluate the final model. The rsample
packages uses its own terminology.
12.2 Creating an initial split of the data into training and holdout set
The first step in a model building step is to split the data into a training and a holdout set. The objective of the holdout set is to get an unbiased estimate of the model performance for the selected best model. The holdout set is used only once at the end of the modeling process and not for any intermediate steps. rsample
refers to the holdout set as the testing set.
We can use the function rsample::initial_split
to create a single split of the data. Here is an example:
## <Training/Testing/Total>
## <24/8/32>
The function initial_split
by itself doesn’t create different subsets of the data, it only creates a blueprint for how the data should be split. Here, we see that the 32 data points are split into 24 data points for training and 8 for testing. By default, the function splits the data into 75% for training and 25% for testing.
To get the individual subsets, we need to use the functions training
and testing
.
The function rsample::initial_split
is used to create a single split of the data. It takes several arguments. The most commonly used ones are:
prop
: the proportion of the data that should be used for training. The default is 0.75.strata
: a variable that is used to stratify the data. The default isNULL
. It is good practice to usebreaks
: this argument is used for creating stratified samples from a continuous variable. It specifies the number of breaks that should be used to create the strata. The default is 4.
12.3 Creating an initial split of the data into training, validation, and holdout set
If for some reason, you cannot afford to use one of the iterative approaches (cross-validation or bootstrap), you can use a single split of the data into training, validation, and holdout set. The training set is used to fit the model, the validation set to compare different models, and the holdout set to evaluate the final model. The function rsample::initial_validation_split
allows you to create a random split of your dataset.
## <Training/Validation/Testing/Total>
## <19/6/7/32>
We see that the 32 data points are split into 19 data points for training, 6 for validation, and 7 for testing/holdout. By default, the function splits the data into 60% for training, 20% for validation, and 20% for testing/holdout.
To get the individual subsets, we need to use the functions training
, validation
, and testing
.
Code
Further information:
- https://rsample.tidymodels.org/ is the documentation for the
rsample
package.
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_validate.png")
knitr::include_graphics("images/stratified-continuous.png")
knitr::include_graphics("images/validation_repeated_holdout.png")
knitr::include_graphics("images/validation_train_validation_holdout_split.png")
knitr::include_graphics("images/validation_cross_validation.png")
knitr::include_graphics("images/validation_bootstrap.png")
library(tidymodels)
# or
library(rsample)
set.seed(1353)
car_split <- initial_split(mtcars)
car_split
train_data <- training(car_split)
test_data <- testing(car_split)
set.seed(9872)
car_split <- initial_validation_split(mtcars)
car_split
train_data <- training(car_split)
validation_data <- validation(car_split)
holdout_data <- testing(car_split)