DS-6030 Homework Module 8
Module 8 introduced ensemble models. In this homework, we will build various regression models to predict fish toxicity of chemical compounds.
You can download the Quarto Markdown file and use it to answer the following questions. If not otherwise stated, use Tidyverse and Tidymodels for the assignments.
The knitting times for the assignment can be very long without caching and parallel processing. For example, the calculations for problem 1 will take more than 16 minutes without parallel execution. Parallel processing reduces it to 5 minutes. With all results being cached, knitting a document will take only a few seconds after changing the text. You can find more information about caching and parallel processing in the online material.
You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:
- Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
- Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
- Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up
tidymodelsfunctions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.
The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.
1. Predicting fish toxicity of chemicals (regression)
The dataset qsar_fish_toxicity.csv contains information about toxicity of 908 chemicals to fish. The data was downloaded from the UCI Machine Learning Repository. The dataset contains 7 features and the toxicity of the chemicals. The features are a variety of molecular structure descriptors. The toxicity is measured as the negative logarithm of the concentration that kills 50% of the fish after 96 hours of exposure.
- CIC0: Information indices
- SM1_Dz(Z): 2D matrix-based descriptors
- GATS1i: 2D autocorrelations
- NdsCH: Atom-type counts
- NdssC: Atom-type counts
- MLOGP: Molecular properties
- LC50: Toxicity towards fish (log value)
- Data loading and preparation:
(1.1) Load the dataset qsar_fish_toxicity.csv. Hint: the dataset has no column headers and its fields are separated by a semicolon; read the documentation for read_delim. (1 point - coding)
(1.2) Split the data into training (80%) and test (20%) sets using stratified sampling on LC50. Prepare the folds for 10-fold cross validation of all models. (1 point - coding)
- Model training:
(1.3) Build the following models using the training set and evaluate them using 10-fold cross validation. For tuned models, use the autoplot function to inspect the tuning results and carry out cross-validation with the optimal parameters. (6 points - coding)
- Linear regression
- Random forest (tune
min_nandmtry) - Boosting model (tune
learn_rate,trees,min_nandtree_depth) - k-nearest neighbors (tune
neighbors)
(1.4) Report the cross-validation metrics (RMSE and \(r^2\)). What do you observe? Which model would you pick based on these results? (2 points - discussion)
(1.5) Following cross-validation and tuning, fit final models with the optimal parameters to the training set. (1 point - coding)
- Model evaluation:
(1.6) Evaluate the models on the test set and report their performance metrics RMSE and MAE. (1 point - coding/discussion)
(1.7) Create a visualization that compares the RMSE values for training and test sets of the four models. Do you see a difference between the models? Is there an indication of overfitting for any of the models? (2 points - coding/discussion)
(1.8) Create residual plots from the cross-validated predictions for the four models (add geom_smooth line). Combine the four plots in a single figure using the patchwork package. What do you observe? Are there differences in the residuals of the four models? (1 point - coding/discussion)
(1.9) For the boosting model, report the variable importance. You can use the vip package to create a variable importance plot. (1 point - coding)