DS-6030 Homework Module 5

Author

Note

In module 5, we learned about tuning hyperparameters of models. In this homework, we are using L1/L2 penalties and dimensionality reduction using PCA and PLS to control the flexibility of our models and reduce the risk of overfitting. We will apply these techniques to build classification and regression models. Both assignments 1 and 2 use datasets from previous assignments. You can reuse the preprocessing steps from these assignments.

You can download the Quarto Markdown file and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

As you will find out, the knitting times for the assignment will get longer as you add more code. To speed up the knitting process, use caching and parallel processing. You can find more information about caching and parallel processing in the online material.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Regularization concepts (2 points)

(1.1) Why is step_normalize typically applied before L1/L2 regularization? What goes wrong if you regularize without standardizing predictors first? (1 point - discussion)

NoteLLM assignment

(1.2) Now consider when standardization is not appropriate. Ask an LLM: “When should I not apply step_normalize before fitting a model?”

Evaluate the LLM’s answer:

  • Does it identify specific cases where normalization is unnecessary or harmful (e.g., tree-based models that are scale-invariant, dummy / indicator predictors after step_dummy, sparse features where centering destroys sparsity)?
  • Or does it give vague advice without concrete examples?

Add at least one example to its list in your own words, with a short justification.

(1 point - discussion)

2. Build elasticnet model for predicting airfare prices (L1/L2 regularization)

The Airfares.csv.gz dataset was already used in problem 1 of module 2. In that assignment, we built a model to predict the price of an airline ticket FARE using a linear regression model. In this assignment, we will build a model to predict the price of an airline ticket FARE using a linear regression model with both L1 and L2 regularization (tuning parameters mixture and penalty).

Load the Airfares.csv.gz data.

(2.1) Load and preprocess the data. Reuse the preprocessing steps you developed in module 2. (1 point - coding)

(2.2) Split the data into a training (75%) and test set (25%). Prepare the resamples for 10-fold cross-validation using the training set. (1 point - coding)

(2.3) Define workflow and tunable parameters. In the recipe, include a step to convert the categorical / nominal variables to dummy variables (step_dummy(all_nominal_predictors())) (1 point - coding)

(2.4) Tune the model with 10-fold cross-validation using Bayesian hyperparameter optimization. Make sure that your search space covers a suitable range of values by inspecting the autoplot of the tuning results. (see Machine learning with tidymodels: Bayesian Hyperparameter optimization) (2 points - coding)

(2.5) Train a final model using the best parameter set. (1 point - coding)

(2.6) Predict the FARE for the test set and calculate the performance metrics on the test set. (1 point - coding)

3. NASA: Asteroid classification - classification with dimensionality reduction

The dataset nasa.csv contains information about asteroids and if they are considered to be hazardous or not.

  1. Data loading and preprocessing:

(3.1) Load the NASA data and preprocess them. You can find the necessary preprocessing steps in the assignment for module 3. (1 point - coding)

(3.2) Split the dataset into a training and test set. Use 80% of the data for training and 20% for testing. Use stratified sampling to ensure that the training and test set have the same proportion of hazardous asteroids. (1 point - coding)

  1. Model building:

(3.3) Build a logistic regression classification model using principal components (step_pca) as predictors. Use cross-validation to determine the optimal number of components. (see Machine learning with tidymodels: Specifying tunable parameters) (3 points - coding)

  • Use step_normalize and step_pca to preprocess the data in a recipe.
  • Use the tune function to find the best number of components (num_comp) in the range 1 to 14 using AUC as the selection criterion. Check all possible numbers of components from 1 to 14 (the predictor count after the preprocessing steps).
  • Use the autoplot function on the cross-validation results to visualize the results. Describe your observations.
  • Report the best number of components and the associated classification metrics.
  • Using the best parameter set, train a final model using the full training set and determine the performance metrics on the test set.

(3.4) Repeat (3.3) using PLS (step_pls) in the preprocessing steps of the predictors. The model is still classification using logistic regression (see Machine learning with tidymodels: Partial least squares regression on how to install the required packages) (3 points - coding)

  • Use step_normalize and step_pls to preprocess the data in a recipe.
  • Use the tune function to find the best number of components (num_comp) in the range 1 to 14 using AUC as the selection criterion. Check all possible numbers of components from 1 to 14 (the predictor count after the preprocessing steps).
  • Use the autoplot function on the cross-validation results to visualize the results. Describe your observations.
  • Report the best number of components and the associated classification metrics.
  • Using the best parameter set, train a final model using the full training set and determine the performance metrics on the test set.
NoteLLM assignment

(3.5) Create a figure that combines the autoplot of the tuning results from (3.3) and (3.4). Do you see different behavior in the autoplot graphs? What do you think is going on? Could you reduce the number of components further from what is suggested by CV?

Then, without showing the LLM your figure or results, ask it: “I’m building a classification model with a recipe that uses step_pca for dimensionality reduction. Should I switch to step_pls? Why or why not?”

Evaluate the LLM’s answer:

  • Does it correctly identify that PCA is unsupervised (ignores the response) while PLS uses the response, so PLS components are more likely to be discriminative for classification?
  • Or does it default to vague “it depends” without engaging with the supervised/unsupervised asymmetry?

Compare with what you observed in your figure. Did the data confirm or contradict the LLM’s answer?

(2 points - discussion)