DS-6030 Homework Module 4

Author

Note

This module introduced cross-validation and bootstrapping as methods to estimate model performance. The homework assignment will give you the opportunity to apply these methods to two datasets. In assignment (1), you will use cross-validation to compare logistic regression, LDA, and QDA classification models. Assignment (2) uses bootstrap to estimate confidence intervals for the mean absolute error and root mean squared error of a linear regression model.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the Quarto Markdown file and use it as a basis for your solution.

As you will find out, the knitting times for this assignment will be longer than in previous homework. To speed up the knitting process, use caching and parallel processing. You can find more information about caching and parallel processing in the online material.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Diabetes dataset (11 points)

The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. For this study, the target variable was converted to a binary variable with 1 for diabetes or pre-diabetic and 0 for healthy. Information about the dataset can be found in the data dictionary.

For this exercise use caching and parallel processing to speed up your computations.

  1. Data loading and preprocessing:

(1.1) Load the diabetes dataset(diabetes_binary_5050split_health_indicators_BRFSS2015.csv.gz dataset). Convert the Diabetes_binary to a factor with labels Healthy and Diabetes. Convert all other variables that only contain values of 0 and 1 to factors. (1 point - coding)

(1.2) Split the data into a training and test set using a 50/50 split. Use the set.seed() function to ensure reproducibility. (1 point - coding)

  1. Model training:

(1.3) Build a logistic regression model to predict Diabetes_binary using all other variables as predictors. Use the training set to fit the model using 10-fold cross-validation. Report the cross-validation accuracy and ROC-AUC of the model. (see Machine learning with tidymodels: Model validation using cross-validation) (2 points - coding)

(1.4) Use the approach from (1.3) to build LDA and QDA models. Report the cross-validation accuracy and ROC-AUC of each model. (4 points - coding)

  1. Cross-validation and test set performance:

(1.5) Create a plot that compares the ROC curves of the three models from (1.3) and (1.4). The ROC curve should be based on the predictions from cross-validation.

How do the models compare? Which model would you choose for prediction? (1 point - coding/discussion)

(1.6) After fitting the three models using the full training set, estimate the performance metrics on the test set. Report the accuracy and ROC-AUC of each model. How do the models compare? Do you see a difference compared to the cross-validation results? (1 point - coding/discussion)

NoteLLM assignment

(1.7) Ask an LLM: “I’m running cross-validation on a dataset with 71000 rows. Should I use 5-fold, 10-fold, or leave-one-out cross-validation?” without telling it any other details about the problem.

Evaluate the answer:

  • Does the LLM consider sample size, computational cost, and the bias-variance tradeoff specifically? Or does it default to “10 is standard”?
  • What information about your problem (computational budget, target variance of the estimate, dataset structure) would be necessary to actually answer this question? (1 point - discussion)

2. Estimate model performance using bootstrap (7 points)

(2.1) Use the mtcars dataset from Machine learning with tidymodels: The mtcars dataset to estimate the mean absolute error (MAE) and root mean squared error (RMSE) of the linear regression model for the prediction of mpg using bootstrap. Use 1000 bootstrap samples to refit the model and collect the performance metrics. Report the mean and standard deviation of the two metrics. (see Machine learning with tidymodels: Model validation using bootstrapping) (3 points - coding)

(2.2) Create a plot of the distribution of the performance metrics. Comment on the shape of the distribution. (1 point - coding/discussion)

(2.3) Use the performance metrics calculated for the bootstrap samples to estimate the 95% confidence interval for the mean absolute error (MAE) and root mean squared error (RMSE). Report the confidence intervals. (2 points - coding/discussion)

NoteLLM assignment

(2.4) Paste your bootstrap 95% CI for the RMSE from (2.3) into an LLM and ask: “What does this confidence interval mean? If I deploy this model, can I expect any individual prediction’s error to fall in this range?”

Evaluate the LLM’s answer:

  • Does it correctly distinguish between a CI for the expected RMSE (averaged over many predictions) and a CI for individual prediction errors?
  • Does it correctly tie the CI back to the bootstrap procedure (resampling rows from the training data), or treat it as a generic frequentist CI?
  • Rewrite any incorrect or sloppy interpretation in your own words.

(1 point - discussion)