Chapter 12 Sampling from a dataset

Figure 12.1: Sampling for model validation using rsample

Sampling from a dataset is at the core of all validation and tuning methods. In most case, the sampling is done using a random process. There are two ways of sampling from a dataset:

Sampling without replacement: each observation can be sampled only once. We use this approach for creating a holdout set and for cross-validation.
Sampling with replacement: the same observation can be sampled more than once. This approach to sampling is at the core of the bootstrap method.

It can also be important to consider the structure of the dataset when sampling. With structure, we mean the distribution of the data with respect to the predictors or the data. This type of sampling is called stratified sampling. With stratified sampling, you can make sure that the distribution of the data is preserved in the sample. For example, if you have a binary outcome, you can make sure that the sample contains the same proportion of positive and negative outcomes as the original dataset. For continuous outcomes, it would mean that the sample has the same distribution of the outcome as the original dataset.

You can see the effect of stratified sampling for a continuous variable in Figure 12.2. The two panels shows the distribution of the continuous variable for a random sample (left) and a stratified sample (right). The original distribution is shown in red and the distribution of the samples in grey. We can clearly see that stratified sampling preserves the distribution of the data better.

Figure 12.2: Effect of using stratified sampling on a continuous variable

Useful to know:

Sampling, like all calculations that depend on randomness, uses a random number generator which means that repeated execution of the same code will lead to different results. You can make your calculations reproducible by setting a seed using the set.seed() function. The seed can be any number. The same seed will always produce the same random numbers. In general, it is good practice to set a seed to make sure that the results do not change between runs. You will see that we use the set.seed() function in the examples below.

12.1 Sampling in statistical modeling

In statistical modeling, we use sampling to create various subsets of the data. Common scenarios are:

Split the data randomly into training and holdout set; use the training set to fit the model and the holdout set to evaluate the model. This approach can be taken if your model doesn’t equire tuning. Using the holdout set to evaluate the model is a way to get an unbiased estimate of the model performance. It is good practice to repeat this process several times to get a more stable estimate of the model performance. This is called repeated holdout (see Figure 12.3)

Figure 12.3: Repeated holdout: splitting the dataset randomly into 80% training (blue) and 20% holdout (orange) sets; process is repeated five times

Split the data randomly into training, validation, and holdout set; use the training set to fit various models, select a specific model using the performance on the validation set, and use the holdout set to evaluate the final model. This approach can be taken if you have sufficient data and training the model is costly. See Figure 12.4 for an illustration of this approach.

Figure 12.4: Training/validation/holdout split: splitting the dataset randomly into 60% training, 20% validation, and 20% holdout sets

The previous scenario relies on a single training/validation split for comparing the performance of different models. A more robust approach is to use cross-validation. In cross-validation, the data is split systematically into \(k\) folds. Each fold is used as a validation set and the model is trained on the remaining \(k-1\) folds. In total, we train \(k\) models. Each of the folds is used as a validation set once. The performance of the model is then averaged over the \(k\) folds. You repeat this \(k\)-fold cross validation approach for each of the models you want to compare and pick the best model based on the estimated performance. Finally, holdout set is used to make a decision on deploying the model or not. See Figure 12.5 for an illustration of this approach.

Cross-validation: splitting the dataset randomly into 5 folds; each fold is used as a holdout set once and the model is trained on the remaining folds

Figure 12.5: Cross-validation: splitting the dataset randomly into 5 folds; each fold is used as a holdout set once and the model is trained on the remaining folds

A similar approach to cross-validation is using bootstrap. Here the data are split randomly with replacement into a training and a validation set. The training set is used to train a model and the validation set to estimate the performance. Because of sampling with replacement, the training set will contain duplicates and the validation set will have different size for each bootstrap sample. However, because we repeat this bootstrap splitting several times, each data point will eventually be used in training and in validation. It’s up to you how many bootstrap samples you create. Once we evaluated each bootstrap sample, the performance estimates are combined and used to compare the various models and pick the best model based on the estimated performance. It is best to use the same splittings for each of the models. See Figure 12.6 for an illustration of this approach.

Figure 12.6: Bootstrap: splitting the dataset randomly into training and validation set with replacement

There are more variations to these approaches. For example, you can use nested cross-validation to tune the hyperparameters of a model. In this case, you use cross-validation to compare different models and then use cross-validation again to tune the hyperparameters of the best model. There are also specialized approaches for time series data. We will not cover these approaches in this class.

In this section, we will see how to implement these approaches using the rsample package which is part of tidymodels.

Code

library(tidymodels)
# or
library(rsample)

Useful to know:

If you look at the literature, you will find that the terminology is not always consistent. For example, some authors use the term validation set to refer to the holdout set. In this class, we will use the term validation set to refer to the set that is used to compare different models and holdout set to refer to the set that is used to evaluate the final model. The rsample packages uses its own terminology.

12.2 Creating an initial split of the data into training and holdout set

The first step in a model building step is to split the data into a training and a holdout set. The objective of the holdout set is to get an unbiased estimate of the model performance for the selected best model. The holdout set is used only once at the end of the modeling process and not for any intermediate steps. rsample refers to the holdout set as the testing set.

We can use the function rsample::initial_split to create a single split of the data. Here is an example:

Code

set.seed(1353)
car_split <- initial_split(mtcars)
car_split

## <Training/Testing/Total>
## <24/8/32>

The function initial_split by itself doesn’t create different subsets of the data, it only creates a blueprint for how the data should be split. Here, we see that the 32 data points are split into 24 data points for training and 8 for testing. By default, the function splits the data into 75% for training and 25% for testing.

To get the individual subsets, we need to use the functions training and testing.

Code

train_data <- training(car_split)
test_data <- testing(car_split)

The function rsample::initial_split is used to create a single split of the data. It takes several arguments. The most commonly used ones are:

prop: the proportion of the data that should be used for training. The default is 0.75.
strata: a variable that is used to stratify the data. The default is NULL. It is good practice to use
breaks: this argument is used for creating stratified samples from a continuous variable. It specifies the number of breaks that should be used to create the strata. The default is 4.

12.3 Creating an initial split of the data into training, validation, and holdout set

If for some reason, you cannot afford to use one of the iterative approaches (cross-validation or bootstrap), you can use a single split of the data into training, validation, and holdout set. The training set is used to fit the model, the validation set to compare different models, and the holdout set to evaluate the final model. The function rsample::initial_validation_split allows you to create a random split of your dataset.

Code

set.seed(9872)

car_split <- initial_validation_split(mtcars)
car_split

## <Training/Validation/Testing/Total>
## <19/6/7/32>

We see that the 32 data points are split into 19 data points for training, 6 for validation, and 7 for testing/holdout. By default, the function splits the data into 60% for training, 20% for validation, and 20% for testing/holdout.

To get the individual subsets, we need to use the functions training, validation, and testing.

Code

train_data <- training(car_split)
validation_data <- validation(car_split)
holdout_data <- testing(car_split)

Further information:

https://rsample.tidymodels.org/ is the documentation for the rsample package.

Code

The code of this chapter is summarized here.

Code

knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
knitr::include_graphics("images/model_workflow_validate.png")
knitr::include_graphics("images/stratified-continuous.png")
knitr::include_graphics("images/validation_repeated_holdout.png")
knitr::include_graphics("images/validation_train_validation_holdout_split.png")
knitr::include_graphics("images/validation_cross_validation.png")
knitr::include_graphics("images/validation_bootstrap.png")
library(tidymodels)
# or
library(rsample)
set.seed(1353)
car_split <- initial_split(mtcars)
car_split
train_data <- training(car_split)
test_data <- testing(car_split)
set.seed(9872)

car_split <- initial_validation_split(mtcars)
car_split
train_data <- training(car_split)
validation_data <- validation(car_split)
holdout_data <- testing(car_split)