Chapter 27 Technical details

Once you start to develop models for larger datasets, you will realize that tuning models can take a long time. There are a few strategies to help speed up the process.

  • Work with smaller datasets during development of your workflow until you get your R code right. Then switch to the full dataset. Chapter 12 shows how to create a representative subset of your data. This is a good strategy if you are working with a large dataset and you are not sure if your code is correct.

  • Use parallel processing to speed up the tuning process. Approaches like cross-validation greatly benefit from this. Chapter 27.1 shows how to do this. We made use of this approach in various chapters in this book. This is a good strategy that is useful for larger datasets.

  • Use caching to speed up the tuning process. Section 27.2 shows how to use caching to speed up the tuning process. This is a good strategy for when you are confident in your code and results, and you work on describing and discussing them in the text.

Load required packages:

Code
library(tidymodels)
library(tidyverse)

27.1 Parallel processing

Modern computers no longer exist of a single CPU, but have multiple cores. You can use these cores to speed up the tuning process. R has a variety of packages to setup and use parallel processing on your computer. The package doParallel is one of them. At the start of your R or _Rmarkdown file, you need to setup the parallel processing environment.

Code
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)

After importing the doParallel package, we create a cluster of workers. The number of workers is set to the number of cores on your computer. You can change this number to a lower number if you want to keep some cores available for other tasks or if your calculations runs out of memory (RAM). The registerDoParallel function registers the cluster of workers as the parallel processing environment.

Once you are done with all calculations, you can stop the parallel processing environment by executing the following code:

Code
stopCluster(cl)
registerDoSEQ()

The stopCluster function stops the cluster of workers. The registerDoSEQ function registers the default sequential processing environment. Note: we didn’t execute this code here, because we will use parallel processing in the following.

Code
data <- read_csv("https://gedeck.github.io/DS-6030/datasets/UniversalBank.csv.gz")
data <- data %>% 
    select(-c(ID, `ZIP Code`)) %>%
    rename(
        Personal.Loan = `Personal Loan`,
        Securities.Account = `Securities Account`,
        CD.Account = `CD Account`
    ) %>%
    mutate(
        Personal.Loan = factor(Personal.Loan, labels=c("Yes", "No"), levels=c(1, 0)),
        Education = factor(Education, labels=c("Undergrad", "Graduate", "Advanced")),
    )

formula <- Personal.Loan ~ Income + Family + CCAvg + Education + 
                           Mortgage + Securities.Account + CD.Account + Online + 
                           CreditCard
recipe_rf <- recipe(formula, data=data)
model_rf <- rand_forest(mode="classification", mtry=tune()) %>% 
    set_engine("ranger")
workflow_rf <- workflow() %>%
    add_recipe(recipe_rf) %>%
    add_model(model_rf)
parameters <- extract_parameter_set_dials(workflow_rf)
parameters <- parameters %>% 
    update(
        mtry = mtry(c(2, 6))
    )
Code
set.seed(1)
resamples <- vfold_cv(data, v = 10)

timings <- tibble()
for (number_cores in 1:20) {
    library(doParallel)
    cl <- makePSOCKcluster(number_cores)
    registerDoParallel(cl)

    # train random forest model five times using cross-validation and 
    # measure execution time
    start.time <- Sys.time()
    for (number_repeats in 1:5) {
        tune_rf <- tune_grid(
            workflow_rf,
            resamples = resamples,
            grid = grid_regular(parameters, levels=5)
        )
    }
    end.time <- Sys.time()
    time.taken <- round(end.time - start.time,2)
    if (length(timings) == 0) {
        timings <- tibble(
            number_cores = number_cores,
            time_taken = time.taken
        )
    } else {
        timings <- timings %>% 
            add_row(
                number_cores = number_cores,
                time_taken = time.taken
            )
    }
    stopCluster(cl)
    registerDoSEQ()
}
timings$average_time_taken = as.numeric(timings$time_taken) / 5
timings$ideal = timings$average_time_taken[1] / timings$number_cores

We train \(5 \times 5 \times 10 = 250\) models (5 repeats, 5 tuning, 10-fold cross-validation). Figure 27.1 shows the reduction in tuning time when parallel processing is used.

Code
ggplot(timings, aes(x=number_cores, y=average_time_taken)) + 
    geom_line() + 
    geom_line(aes(y=ideal), color='grey') + 
    geom_point() + 
    geom_vline(xintercept=parallel::detectCores(logical = FALSE), color="red") +
    xlab("Number of cores") +
    ylab("Average time required for tuning (seconds)")
Average time require for tuning a random forest model using different number of cores. The red line indicates the number of physical cores on the computer. The grey line the expected time if parallel execution would be perfect.

Figure 27.1: Average time require for tuning a random forest model using different number of cores. The red line indicates the number of physical cores on the computer. The grey line the expected time if parallel execution would be perfect.

Without parallelization, tuning the random forest model takes about 15.4 seconds. Distributing the calculation on two cores, reduces the time to 8.2 seconds. If parallelization would be perfect, the time should be reduced to 7.7 seconds. There is overhead involved in parallelization and we can already see the effect of this here. Nevertheless, increasing the number of cores even more brings the calculation time down to 2.

While not clearly visible in the graph, the improvement goes through a minimum around 5 to 10 cores. After that, the average time required for tuning increases again. The computer used for the graph was a M3 Macbook Pro with 14 cores. The fact that we see a flattening from 5 to 9 and after 10 reflects the use of 10-fold cross-validation with five repeats.

On some computers, you might see hyperthreading. This is a technique to make the computer appear to have more cores than it actually has. This can be useful for tasks that require a lot of communication with other resources. However, for calculations, hyperthreading is not as efficient as physical cores.

Useful to know:

When you use parallelization and you get error messages like this, your computer is running out of memory (RAM).

Error in unserialize(socklist[[n]])): error reading from connection
Error in serialize(data, node$con): error writing to connection
Error in `summary.connection()`: ! invalid connection

In this case, reduce the number of cores in the cluster using e.g. makePSOCKcluster(4) to use only 4 cores.

Further information: The tidymodels package will already use parallelization if the a cluster is setup. However, if you want to parallelize your own code, you can use the foreach package. It’s fairly easy to use. See the following links for examples: - https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf gives a short overview on how to use the doParallel package. - https://www.blasbenito.com/post/02_parallelizing_loops_with_r/ is a more in depth introduction into parallelization with the foreach package.

27.2 Caching

Running the code in the previous chapter takes more than 76 seconds. Executing the code every time this document needs to be recreated, will make working with the document insufferable. We can use caching to speed up the process. Caching is a technique to save the results of a calculation and reuse them later. The next time the code is executed, the results are loaded from the cache instead of being recalculated.

Using caching with Rmarkdown is straightforward. Above we used the cache option in the code chunk header to enable caching. Here is the markup for the code chunk that loads and prepares the data.

```{r cache-prepare-model}
#| message: FALSE
#| warning: FALSE
#| cache: TRUE

code to load and prepare the model

```

Note two points here. Firstly, the code chunk has a name. Secondly, the code chunk header contains the cache: TRUE option. The name identifies the cache. If you change the code inside the code chunk, the cache is invalidated and the code is executed again.

We also use caching the code chunk that tunes the model.

```{r cache-time-tuning}
#| message: FALSE
#| warning: FALSE
#| cache: TRUE
#| dependson: cache-prepare-model

code to tune the model

```

Like before, we use the cache: TRUE option which will cache the results of this code chunk too. Consider now the case where you knitted the document and both code chunks are cached and you make a change to the first code chunk which leads to a change of the data. When you knit the document again, this data are reloaded and preprocessed. The code chunk that tunes the model however was not changed and would in principle not be executed again. This would lead to invalid results. To avoid this, we use the dependson option in the code chunk header of the code chunk that tunes the model. The dependson option tells Rmarkdown that the code chunk that tunes the model depends on the code chunk that loads and prepares the model. If the code chunk that loads and prepares the model is executed, the code chunk that tunes the model will be executed too.

If you have a code chunk that depends on multiple previous code chunks, you can use a comma separated list of names in the dependson option. In this example, test3 depends on test1 and test2.

```{r test1}
#| cache: TRUE
x = 10
x
```
    
```{r test2}
#| cache: TRUE
#| dependson: test1
y = 2
y
```

```{r test3}
#| cache: TRUE
#| dependson: test1, test2
x + y
```

You can make a code chunk depend on all previous code chunks by using the dependson=knitr::all_labels() option.

Useful to know:

Another option is to add the following code at the start of your document.

knitr::opts_chunk$set(cache=TRUE, autodep=TRUE)

This will cache all code chunks in the document and automatically create the dependencies based on the variables used in and between code chunks.

If you use this approach together with parallelization, explicitly set the code that starts the cluster to not being cached.

#| cache: FALSE
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores(logical = FALSE))
registerDoParallel(cl)

Further information: