DS-6030 - Statistical Learning
R for exploratory data analysis and statistical modeling
2025-06-23
Introduction
In the DS-6030 Statistical Learning course, we will use
- the
tidyversepackages for data loading and processing - the
tidymodelpackages for model building and validation
Compared to the classical base-R packages covered in An Introduction to Statistical Learning (James et al. 2021), these packages offer many advantages that will make working with data easier and more streamlined.
Tidyverse
The tidyverse is a collection of packages that share a common design philosophy and are designed to work together. Hadley Wickham outlined the principles of the tidyverse in 2014 in the Tidy Data paper published in the Journal of Statistical Software 59(10), 1–23.
To load the tidyverse, use the following command:
## ── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You will see that this loads a number of packages. The most important ones are:
ggplot2for plottingdplyrfor data manipulationreadrfor data importtibblefor improved data framestidyrfor getting data into tidy formpurrrfor functional programmingstringrfor string manipulationforcatsfor categorical/factor data
Tidymodels
The tidymodels package is developed by Max Kuhn who now works at RStudio / posit. It was first released in 2018 and is still under active development. It is an ecosystem of packages that share a common design philosophy and are designed to work together. The packages include
parsnipfor model specificationrecipesfor data preprocessingrsamplefor resamplingyardstickfor model evaluationtunefor hyperparameter tuningworkflowsfor modeling workflowstidyposteriorfor Bayesian modeling
The tidymodels packages are designed to work with the tidyverse and tidydata principles. The packages are designed to be modular and extensible.
Getting Help
- A good source of basic data analysis using R is found in the free book R for Data Science (2e) by Wickham et al. (Wickham, Çetinkaya-Rundel, and Grolemund 2023).
- Web search, especially stackoverflow.com and stats.stackexchange.com
- Troubleshooting/Debugging.
- Check one line of code at a time.
- Google your error message
- Use scripts