DS-6030 - Statistical Learning
R for exploratory data analysis and statistical modeling
2024-08-07
Introduction
In the DS-6030 Statistical Learning course, we will use
- the
tidyverse
packages for data loading and processing - the
tidymodel
packages for model building and validation
Compared to the classical base-R packages covered in An Introduction to Statistical Learning (James et al. 2021), these packages offer many advantages that will make working with data easier and more streamlined.
Tidyverse
The tidyverse is a collection of packages that share a common design philosophy and are designed to work together. To load the tidyverse, use the following command:
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You will see that this loads a number of packages. The most important ones are:
ggplot2
for plottingdplyr
for data manipulationreadr
for data importtibble
for improved data framestidyr
for getting data into tidy formpurrr
for functional programmingstringr
for string manipulationforcats
for categorical/factor data
Tidymodels
The tidymodels package is developed by Max Kuhn who now works at RStudio / posit. It was first released in 2018 and is still under active development. It is an ecosystem of packages that share a common design philosophy and are designed to work together. The packages include
parsnip
for model specificationrecipes
for data preprocessingrsample
for resamplingyardstick
for model evaluationtune
for hyperparameter tuningworkflows
for modeling workflowstidyposterior
for Bayesian modeling
The tidymodels packages are designed to work with the tidyverse
and tidydata
principles. The packages are designed to be modular and extensible.
Getting Help
- A good source of basic data analysis using R is found in the free book R for Data Science (2e) by Wickham et al. (Wickham, Çetinkaya-Rundel, and Grolemund 2023).
- Web search, especially stackoverflow.com and stats.stackexchange.com
- Troubleshooting/Debugging.
- Check one line of code at a time.
- Google your error message
- Use scripts