Module 5

In module 5, we learned about tuning hyperparameters of models. In this homework, we are using L1/L2 penalties and dimensionality reduction using PCA and PLS to control the flexibility of our models and reduce the risk of overfitting. We will apply these techniques to build classification and regression models. Both assignments 1 and 2 use datasets from previous assignments. You can reuse the preprocessing steps from these assignments.

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-5.Rmd) and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

As you will find out, the knitting times for the assignment will get longer as you add more code. To speed up the knitting process, use caching and parallel processing. You can find more information about caching here and about parallel processing here.

1. Build elasticnet model for predicting airfare prices (L1/L2 regularization)

The Airfares.csv.gz dataset was already used in problem 1 of module 2. In that assignment, we built a model to predict the price of an airline ticket FARE using a linear regression model. In this assignment, we will build a model to predict the price of an airline ticket FARE using a linear regression model with both L1 and L2 regularization (tuning parameters mixture and penalty).

Load the data from https://gedeck.github.io/DS-6030/datasets/homework/Airfares.csv.gz

(1.1) Load and preprocess the data. Reuse the preprocessing steps you developed in module 2. (1 point - coding)

(1.2) Split the data into a training (75%) and test set (25%). Prepare the resamples for 10-fold cross-validation using the training set. (1 point - coding)

(1.3) Define workflow and tuneable parameters (1 point - coding)

(1.4) Tune the model with 10-fold cross-validation using Bayesian hyperparameter optimization. Make sure that your search space covers a suitable range of values. (see DS-6030: Bayesian Hyperparameter optimization) (2 point - coding)

(1.5) Train a final model using the best parameter set. (1 point - coding)

(1.6) Predict the FARE for the test set and calculate the performance metrics on the test set. (1 point - coding)

2. NASA: Asteroid classification - classification with dimensionality reduction

The dataset nasa.csv contains information about asteroids and if they are considered to be hazardous or not.

  1. Data loading and preprocessing

(2.1) Load the data from https://gedeck.github.io/DS-6030/datasets/nasa.csv and preprocess the data. You can find the necessary preprocessing steps in module 3. (1 point - coding)

(2.2) Split the dataset into a training and test set. Use 80% of the data for training and 20% for testing. Use stratified sampling to ensure that the training and test set have the same proportion of hazardous asteroids. (1 point - coding)

  1. Model building

(2.3) Build a logistic regression classification model using principal components (step_pca) as predictors. Use cross-validation to determine the optimal number of components. (see DS-6030: Specifying tunable parameters) (3 points - coding)

  • Use step_normalize and step_pca to preprocess the data in a recipe.
  • Use the tune function to find the best number of components (num_comp) in the range 1 to 14 using AUC as the selection criterium. Check all possible numbers of components from 1 to 14.
  • Use the autoplot function on the cross-validation results to visualize the results. Describe your observations.
  • Report the best number of components and the associated regression metrics.
  • Using the best parameter set, train a final model using the full training set and determine the performance metrics on the test set.

(2.4) Repeat (2.3) using PLS (step_pls) in the preprocessing steps of the predictors. The model is still classification using logistic regression (see DS-6030: Partial least squares regression on how to install the required packages) (3 points - coding)

  • Use step_normalize and step_pls to preprocess the data in a recipe.
  • Use the tune function to find the best number of components (num_comp) in the range 1 to 14 using AUC as the selection criterium. Check all possible numbers of components from 1 to 14.
  • Use the autoplot function on the cross-validation results to visualize the results. Describe your observations.
  • Report the best number of components and the associated classification metrics.
  • Using the best parameter set, train a final model using the full training set and determine the performance metrics on the test set.

(2.5) Compare the tuning results in (2.3) and (2.4) and comment on the differences. Do you see different behavior in the autoplot graphs? What do you think is going on? Could you reduce the number of components further from what is suggested by CV? (2 points - discussion)