Module 3

In this module, we learned classification models. Assignment (1) tests your understanding of the differences between LDA and QDA. In assignments (2), you will build several classification models for the NASA Asteroid dataset and estimate their predictive performance using a holdout/test set. Assignment (3) prepares you for the final project by researching approaches to handling class imbalance in classification problems.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-3.Rmd) and use it as a basis for your solution.

Code

library(tidyverse)
library(tidymodels)

1. Differences between LDA and QDA. (4 points)

(1.1) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(1.2) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(1.3) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

(1.4) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

2. NASA: Asteroid classification

The dataset nasa.csv contains information about asteroids and if they are considered to be hazardous or not.

Data loading and preprocessing

(2.1) Load the data from https://gedeck.github.io/DS-6030/datasets/nasa.csv and preprocess the data. You can reuse the preprocessing pipeline we developed in class. (1 point - coding)

Code

remove_columns <- c("Name", "Est Dia in M(min)",
    "Semi Major Axis", "Jupiter Tisserand Invariant",
    "Epoch Osculation", "Mean Motion", "Aphelion Dist",
    "Equinox", "Orbiting Body", "Orbit Determination Date",
    "Close Approach Date", "Epoch Date Close Approach",
    "Miss Dist.(Astronomical)", "Miles per hour")
asteroids <- read_csv("https://gedeck.github.io/DS-6030/datasets/nasa.csv", show_col_types = FALSE) %>%
    select(-all_of(remove_columns)) %>%
    select(-contains("Relative Velocity")) %>%
    select(-contains("Est Dia in KM")) %>%
    select(-contains("Est Dia in Feet")) %>%
    select(-contains("Est Dia in Miles")) %>%
    select(-contains("Miss Dist.(lunar)")) %>%
    select(-contains("Miss Dist.(kilometers)")) %>%
    select(-contains("Miss Dist.(miles)")) %>%
    distinct() %>%
    mutate(Hazardous = as.factor(Hazardous))
dim(asteroids)

## [1] 3692   15

(2.2) Split the dataset into a training and test set. Use 80% of the data for training and 20% for testing. Use stratified sampling to ensure that the training and test set have the same proportion of hazardous asteroids. (2 points - coding)

Model training Build a classification model with tidymodels using four different methods: Null model, Logistic regression, LDA, and QDA.

(2.3) Use the training set to fit the four models and the test set to evaluate the performance of the models. (2 point - coding)

(2.4) For each model, determine and plot the ROC curves for both the training and test set. What do you observe? Use patchwork to arrange the graphs for the four models in a single figure (1 point - coding/discussion)

(2.5) Create a single plot that overlays the ROC curves of the four models for the test set. Which model separates the two classes best? (1 point - coding/discussion)

(2.6) For each model, determine the threshold that maximizes the F-measure (yardstick::f_meas) using the training set. Why is the F-measure a better metric than accuracy in this case? Create plots that show the dependence of the F-measure from the threshold. (2 points - coding/discussion)

(2.7) Determine the accuracy, sensitivity, specificity, and F-measure for each model at the determined thresholds. Which model performs best? How does this compare to the result from the ROC curves? (1 point - coding/discussion)

3. Handling class imbalance in classification problems (4 points)

Write a short essay (about 1/2 page) on the topic of handling class imbalance in classification problems. Here are a few questions to guide your research:

What is class imbalance, and why is it a problem in classification problems?
What are common strategies to handle class imbalance?
What are appropriate evaluation metrics for imbalanced classification problems?

Consult 2 or more sources for your research. You can use the following resource to get started:

https://machinelearningmastery.com/what-is-imbalanced-classification/

Don’t forget to reference your sources including the use of large language models like Chat-GPT.