Introduction

In the DS-6030 Statistical Learning course, we will use

  • the tidyverse packages for data loading and processing
  • the tidymodel packages for model building and validation

Compared to the classical base-R packages covered in An Introduction to Statistical Learning (James et al. 2021), these packages offer many advantages that will make working with data easier and more streamlined.

Tidyverse

The tidyverse is a collection of packages that share a common design philosophy and are designed to work together. Hadley Wickham outlined the principles of the tidyverse in 2014 in the Tidy Data paper published in the Journal of Statistical Software 59(10), 1–23.

To load the tidyverse, use the following command:

Code
library(tidyverse)
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You will see that this loads a number of packages. The most important ones are:

  • ggplot2 for plotting
  • dplyr for data manipulation
  • readr for data import
  • tibble for improved data frames
  • tidyr for getting data into tidy form
  • purrr for functional programming
  • stringr for string manipulation
  • forcats for categorical/factor data

Tidymodels

The tidymodels package is developed by Max Kuhn who now works at RStudio / posit. It was first released in 2018 and is still under active development. It is an ecosystem of packages that share a common design philosophy and are designed to work together. The packages include

  • parsnip for model specification
  • recipes for data preprocessing
  • rsample for resampling
  • yardstick for model evaluation
  • tune for hyperparameter tuning
  • workflows for modeling workflows
  • tidyposterior for Bayesian modeling

The tidymodels packages are designed to work with the tidyverse and tidydata principles. The packages are designed to be modular and extensible.

Getting Help

RStudio

  • Install R and RStudio
  • Make use of Projects in RStudio

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. New York, NY: Springer US. https://doi.org/10.1007/978-1-0716-1418-1.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. “R for Data Science (2e).” https://r4ds.hadley.nz/.