Module 8

Module 8 introduced ensemble models. In this homework, we will build various regression model to predict fish toxicity of chemical compounds. You can download the R Markdown file and use it to answer the following questions. If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

The knitting times for the assignment can be very long without caching and parallel processing. For example, the calculations for problem 1 will take more than 16 minutes without parallel excution. Caching reduces it to 5 minutes. With all results being cached, knitting a document will take only a few seconds. You can find more information about caching here and about parallel processing here.

1. Predicting fish toxicity of chemicals (regression)

The dataset qsar_fish_toxicity.csv contains information about toxicity of 908 chemicals to fish. The data was downloaded from the UCI Machine Learning Repository. The dataset contains 7 features and the toxicity of the chemicals. The features are a variety of molecular structure descriptors. The toxicity is measured as the negative logarithm of the concentration that kills 50% of the fish after 96 hours of exposure.

CIC0: Information indices
SM1_Dz(Z): 2D matrix-based descriptors
GATS1i: 2D autocorrelations
NdsCH: Atom-type counts
NdssC: Atom-type counts
MLOGP: Molecular properties
LC50: Toxicity towards fish (log value)

Data loading and preparation

(1.1) Load the data from https://gedeck.github.io/DS-6030/datasets/homework/qsar_fish_toxicity.csv. Hint: the dataset has no column headers and is fields are separated by a semicolon; read the documentation for read_delim. (1 point - coding)

(1.2) Split the data into training (80%) and test (20%) sets using stratified sampling on LC50. Prepare the folds for 10-fold cross validation of all models. (1 point - coding)

Model training

(1.3) Build the following models using the training set and evaluate them using 10-fold cross validation. For tuned models, use the autoplot function to inspect the tuning results and carry out cross-validation with the optimal parameters. (6 points - coding)

Linear regression
Random forest (tune min_n and mtry)
Boosting model (tune min_n and mtry)
k-nearest neighbors (tune neighbors)

(1.4) Report the cross-validation metrics (RMSE and \(r^2\)). What do you observe? Which model would you pick based on these results (2 points - discussion)

(1.5) Following cross-validation and tuning, fit final models with the optimal parameters to the training set. (1 point - coding)

Model evaluation

(1.6) Evaluate the models on the test set and report their performance metrics RMSE and MAE. (1 point - coding/discussion)

(1.7) Create a visualization that compares the RMSE values for training and test sets of the four models. Do you see a difference between the models? Is there an indication of overfitting for any of the models? (2 point - coding/discussion)

(1.8) Create residual plots from the cross-validated predictions for the four models (add geom_smooth line). Combine the four plots in a single figure using the patchwork package. What do you observe? Are there differences in the residuals of the four models. (1 point - coding/discussion)

(1.9) For the boosting model, report the variable importance. You can use the vip package to create a variable importance plot. (1 point - coding)