Module 4

This module introduced cross-validation and bootstrapping as methods to estimate model performance. The homework assignment will give you the opportunity to apply these methods to two datasets. In assignment (1), you will use cross-validation to compare logistic regression, LDA, and QDA classification models. Assignment (2) uses bootstrap to estimate confidence intervals for the mean absolute error and root mean squared error of a linear regression model.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-4.Rmd) and use it as a basis for your solution.

As you will find out, the knitting times for this assignment will be longer than in previous homework. To speed up the knitting process, use caching and parallel processing. You can find more information about caching and about parallel processing in the course material.

1. Diabetes dataset

The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. For this study, the target variable was converted to a binary variable with 1 for diabetes or pre-diabetic and 0 for healthy. Information about the dataset can be found here.

For this exercise use caching and parallel processing to speed up your computations.

Data loading and preprocessing

(1.1) Load the diabetes dataset from https://gedeck.github.io/DS-6030/datasets/diabetes/diabetes_binary_5050split_health_indicators_BRFSS2015.csv.gz. Convert the Diabetes_binary to a factor with labels Healthy and Diabetes. Convert all other variables that only contain values of 0 and 1 to factors. (1 point - coding)

(1.2) Split the data into a training and test set using a 50/50 split. Use the set.seed() function to ensure reproducibility. (1 point - coding)

Model training

(1.3) Build a logistic regression model to predict Diabetes_binary using all other variables as predictors. Use the training set to fit the model using 10-fold cross-validation. Report the cross-validation accuracy and ROC-AUC of the model. (see DS-6030: Model validation using cross-validation) (2 points - coding)

(1.4) Use the approach from (1.3) to build LDA and QDA models. Report the cross-validation accuracy and ROC-AUC of each model. (4 points - coding)

Cross-validation and test set performance

(1.5) Create a plot that compares the ROC curves of the three models from (1.3) and (1.4). The ROC curve should be based on the predictions from cross-validation. (1 point - coding/discussion)

How do the models compare? Which model would you choose for prediction?

(1.6) After fitting the three models using the full training set, estimate the performance metrics on the test set. Report the accuracy and ROC-AUC of each model. How do the models compare? Do you see a difference to the cross-validation results? (1 point - coding/discussion)

2. Estimate model performance using bootstrap

(2.1) Use the mtcars dataset from DS-6030: The mtcars dataset to estimate the mean absolute error (MAE) and root mean squared error (RMSE) of the linear regression model for the prediction of mpg using bootstrap. Use 1000 bootstrap samples. Report the mean and standard deviation of the two metrics. (see DS-6030: Model validation using bootstrapping) (3 points - coding)

(2.2) Create a plot of the distribution of the performance metrics. Comment on the shape of the distribution. (1 point - coding/discussion)

(2.3) Use the performance metrics calculated for the bootstrap samples to estimate the 95% confidence interval for the mean absolute error (MAE) and root mean squared error (RMSE). Report the confidence intervals. (2 points - coding/discussion)