DS-6030 Homework Module 3

Author

Note

In this module, we learned about classification models. Assignment (1) tests your understanding of the differences between LDA and QDA. In assignment (2), you will build several classification models for the NASA Asteroid dataset and estimate their predictive performance using a holdout/test set. Assignment (3) prepares you for the final project by researching approaches to handling class imbalance in classification problems.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the Quarto Markdown file and use it as a basis for your solution.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

library(tidyverse)
library(tidymodels)

1. Differences between LDA and QDA. (5 points)

(1.1) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set? (1 point - discussion)

(1.2) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set? (1 point - discussion)

(1.3) In general, as the sample size \(n\) increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why? (1 point - discussion)

(1.4) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer. (1 point - discussion)

NoteLLM assignment

(1.5) Submit (1.4) to an LLM verbatim. Ask for True/False plus justification. Compare with your own answer:

If the LLM agrees with you, evaluate whether its justification appeals to bias-variance tradeoff specifically, or to vaguer “flexibility” language. If the LLM disagrees, decide who is right and explain why. (1 point - discussion)

2. NASA: Asteroid classification (11 points)

The dataset nasa.csv contains information about asteroids and if they are considered to be hazardous or not.

  1. Data loading and preprocessing:

(2.1) Load the data from https://gedeck.github.io/DS-6030/datasets/nasa.csv and preprocess the data. Use the code below that we developed in class as your preprocessing pipeline. (1 point - coding)

filename <- "https://gedeck.github.io/DS-6030/datasets/nasa.csv"
remove_columns <- c("Name", "Est Dia in M(min)",
  "Semi Major Axis", "Jupiter Tisserand Invariant",
  "Epoch Osculation", "Mean Motion", "Aphelion Dist",
  "Equinox", "Orbiting Body", "Orbit Determination Date",
  "Close Approach Date", "Epoch Date Close Approach",
  "Miss Dist.(Astronomical)", "Miles per hour")
asteroids <- read_csv(filename, show_col_types = FALSE) |>
  select(-all_of(remove_columns)) |>
  select(-contains("Relative Velocity")) |>
  select(-contains("Est Dia in KM")) |>
  select(-contains("Est Dia in Feet")) |>
  select(-contains("Est Dia in Miles")) |>
  select(-contains("Miss Dist.(lunar)")) |>
  select(-contains("Miss Dist.(kilometers)")) |>
  select(-contains("Miss Dist.(miles)")) |>
  distinct() |>
  mutate(Hazardous = as.factor(Hazardous))
dim(asteroids)
[1] 3692   15

(2.2) Split the dataset into a training and test set. Use 80% of the data for training and 20% for testing. Use stratified sampling to ensure that the training and test set have the same proportion of hazardous asteroids. (2 points - coding)

  1. Model training. Build a classification model with tidymodels using four different methods (null model, logistic regression, LDA, and QDA):

(2.3) Use the training set to fit the four models. (2 points - coding)

(2.4) For each model, determine and plot the ROC curves for both the training and test set. Compare training vs. test ROC for each model — is there a sign of overfitting? Do you observe any differences between the models? Use patchwork to arrange the individual graphs for the four models in a single figure. (1 point - coding/discussion)

(2.5) Create a single plot that overlays the ROC curves of the four models for the test set. Which model separates the two classes best? (1 point - coding/discussion)

(2.6) For each model, determine the threshold that maximizes the F-measure (yardstick::f_meas) using the training set. Why is the F-measure a better metric than accuracy in this case? Hint: determine the class proportions. Create plots that show the dependence of the F-measure on the threshold. (2 points - coding/discussion)

(2.7) Determine the accuracy, sensitivity, specificity, and F-measure for each model at the determined thresholds. Which model performs best? How does this compare to the result from the ROC curves? (1 point - coding/discussion)

NoteLLM assignment

(2.8) Without giving the LLM access to your data or class proportions, ask: “I’m building a binary classifier. What probability threshold should I use to convert predicted probabilities into class labels?” Paste the LLM’s recommendation.

Compare with the data-driven threshold you found in (2.6). What did the LLM assume? On a class-imbalanced problem like this one, when would following the LLM’s generic advice produce a misleading model? (1 point - discussion)

3. Handling class imbalance in classification problems (4 points)

Write a short essay (about 1/2 page) on the topic of handling class imbalance in classification problems. Here are a few questions to guide your research:

  • What is class imbalance, and why is it a problem in classification problems?
  • What are common strategies to handle class imbalance?
  • What are appropriate evaluation metrics for imbalanced classification problems?

Consult 2 or more sources for your research. You can use the following resource to get started:

  • https://machinelearningmastery.com/what-is-imbalanced-classification/

Don’t forget to reference your sources including the use of large language models like Chat-GPT.