Introduction
In the DS-6030 Statistical Learning course, we will use
- the tidyverse packages for data loading and processing
- the tidymodels packages for model building and validation
Compared to the classical base-R packages covered in An Introduction to Statistical Learning (James et al. 2021), these packages offer many advantages that will make working with data easier and more streamlined.
Tidyverse
The tidyverse is a collection of packages that share a common design philosophy and are designed to work together. Hadley Wickham outlined the principles of the tidyverse in 2014 in the Tidy Data paper published in the Journal of Statistical Software 59(10), 1–23.
To load the tidyverse, use the following command:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You will see that this loads a number of packages. The most important ones are:
ggplot2 for plotting
dplyr for data manipulation
readr for data import
tibble for improved data frames
tidyr for getting data into tidy form
purrr for functional programming
stringr for string manipulation
forcats for categorical/factor data
Tidymodels
The tidymodels package was first released in 2018 and is still under active development and maintained by the company Posit as an open source project. It is an ecosystem of packages that share a common design philosophy and are designed to work together. The packages include
parsnip for model specification
recipes for data preprocessing
rsample for resampling
yardstick for model evaluation
tune for hyperparameter tuning
workflows for modeling workflows
tidyposterior for Bayesian modeling
The tidymodels packages are designed to work with the tidyverse and tidydata principles. The packages are designed to be modular and extensible.
RStudio
- Install R and RStudio
- Make use of Projects in RStudio
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021.
An Introduction to Statistical Learning: With Applications in R. Springer
Texts in
Statistics.
New York, NY:
Springer US.
https://doi.org/10.1007/978-1-0716-1418-1.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. “R for Data Science (2e).” https://r4ds.hadley.nz/.