Introduction
This book is an introduction to machine learning with R and the tidymodels ecosystem of packages. It is designed for readers who have some experience with R and want to learn how to use it for machine learning. The book is organized into parts that cover exploratory data analysis, training models, regression and classification models, model validation and tuning, unsupervised learning, and deeper dives into specific model types. Each chapter combines explanations of the underlying concepts with worked examples, and ends with a “Code” section that summarizes all the R code used in the chapter so you can reproduce the results yourself. A final “Examples” part pulls several of these ideas together into complete modeling workflows.
Tidyverse
The tidyverse is a collection of packages that share a common design philosophy and are designed to work together. Hadley Wickham outlined the principles of the tidyverse in 2014 in the Tidy Data paper published in the Journal of Statistical Software 59(10), 1–23.
To load the tidyverse, use the following command:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You will see that this loads a number of packages. The most important ones are:
ggplot2 for plotting
dplyr for data manipulation
readr for data import
tibble for improved data frames
tidyr for getting data into tidy form
purrr for functional programming
stringr for string manipulation
forcats for categorical/factor data
Tidymodels
The tidymodels package was first released in 2018 and is still under active development and maintained by the company Posit as an open source project. It is an ecosystem of packages that share a common design philosophy and are designed to work together. The packages include
parsnip for model specification
recipes for data preprocessing
rsample for resampling
yardstick for model evaluation
tune for hyperparameter tuning
workflows for modeling workflows
tidyposterior for Bayesian modeling
probably for postprocessing of predictions (e.g. thresholding for binary classification models)
tailor for postprocessing of predictions as part of workflows
The tidymodels packages are designed to work with the tidyverse and tidydata principles. The packages are designed to be modular and extensible.
Getting Help
- A good source of basic data analysis using R is found in the free book R for Data Science (2e) by Wickham et al. (Wickham et al. 2023).
- Web search, especially stackoverflow.com and stats.stackexchange.com
- Troubleshooting/Debugging.
- Check one line of code at a time.
- Google your error message
- Use scripts
RStudio
- Install R and RStudio
- Make use of Projects in RStudio
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science (2e). Https://r4ds.hadley.nz/.