Module 6

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-6.Rmd) and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

1. Predict out of state tuition (feature selection)

The data College.csv contains a number of variables for 777 different universities and colleges in the US. In this exercise, we will try to predict the Outstate tuition fee using the other variables in the data set.

Data loading and preprocssing

(1.1) Load the data from ISLR2::College and split into training and holdout sets using a 80/20 split. (1 point - coding)

Train linear regression and Lasso models

Use tidymodels to define a workflow and build a linear regression model to predict Outstate from all the other variables using L1 regularization (Lasso).

(1.2) For preprocessing, normalize all the numerical variables (step_normalize(all_numeric_predictors()) and convert the categorical / nominal variables to dummy variables (step_dummy(all_nominal_predictors())). (1 point - coding)

(1.3) Train a normal linear regression model using the lm engine with the training set. Look at the \(p\)-values of the individual features. Which features are significant (use extract_fit_engine and summary). (1 point - coding/discussion)

(1.4) Use glmnet and tune the L1 penalty parameter using 10-fold cross-validation. Tune the penalty over the range penalty(c(-1, 2.5)) and check that the range is appropriate using autoplot. (2 points - coding)

(1.5) Determine the best penalty parameters using the lowest \(RMSE\) and the penalty obtained from the one-standard-error rule (select_by_one_std_err, see online material). For both penalties, finalize the the workflow and train models using the full training set. Report the coefficients of each model. Which variables are selected in each case? (2 points - coding/discussion)

(1.6) Use the model from (1.3) and the two models from (1.5) to predict the Outstate variable on the training and test set. Report the RMSE and \(R^2\) of each model on the training and test set. (1 point - coding/discussion)

2. Predict out of state tuition (GAM model)

GAM model

Using the significant features from (1.3), build a generalized additive model (GAM) to predict Outstate. (see Generalized additive models (GAM) for how to build GAM models in tidymodels)

(2.1) Define a model formula setting all numerical variables as splines and all categorical variables as factors (Outstate ~ Private + s(Apps) + ...) (1 point - coding)

(2.2) Define the gen_additive_mod model using mgcv as the engine and fit the model using the training data (2 points - coding)

(2.3) Report the RMSE and \(R^2\) of the model on the training and test set and compare with the result from (1.6) (1 point - discussion)

(2.4) Use the plot function with the fitted model (use extract_fit_engine to get the actual mgcv model for plotting; set scale=0 to get individual \(y\)-scales). Describe your observations. (1 point - coding/discussion)

(2.5) Use the summary function to get information about the model. Based on the reported significance levels, could you simplify the model further? (1 point - coding/discussion)

(2.6) Simplify the model by removing the non-significant variables and re-fit the model. Report the RMSE and \(R^2\) of the model on the training and test set. (1 point - coding/discussion)

Comparison

(2.7) Compare the results from the three models from (B), the full GAM model from (2.2) and the reduced model from (2.6)? (2 points - discussion)