DS-6030 Homework Module 6
You can download the Quarto Markdown file and use it to answer the following questions.
If not otherwise stated, use Tidyverse and Tidymodels for the assignments.
You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:
- Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
- Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
- Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up
tidymodelsfunctions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.
The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.
1. Predict out-of-state tuition with feature selection
The data College.csv contains a number of variables for 777 different universities and colleges in the US. In this exercise, we will try to predict the Outstate tuition fee using the other variables in the data set.
- Data loading and preprocessing:
(1.1) Load the data from ISLR2::College and split into training and holdout sets using an 80/20 split. (1 point - coding)
- Train linear regression and Lasso models:
Use tidymodels to define a workflow and build a linear regression model to predict Outstate from all the other variables using L1 regularization (Lasso).
(1.2) For preprocessing, normalize all the numerical variables (step_normalize(all_numeric_predictors())) and convert the categorical / nominal variables to dummy variables (step_dummy(all_nominal_predictors())). (1 point - coding)
(1.3) Train a normal linear regression model using the lm engine with the training set. Look at the \(p\)-values of the individual features. Which features are significant? (use extract_fit_engine and summary) (1 point - coding/discussion)
(1.4) Use glmnet and tune the L1 penalty parameter using 10-fold cross-validation. Tune the penalty over the range penalty(range = c(-1, 2.5)) and check that the range is appropriate using autoplot. (2 points - coding)
(1.5) Determine the best penalty parameters using the lowest \(RMSE\) and the penalty obtained from the one-standard-error rule (select_by_one_std_err, see online material). For both penalties, finalize the workflow and train models using the full training set. Report the coefficients of each model. Which variables are selected in each case? (2 points - coding/discussion)
(1.6) Use the model from (1.3) and the two models from (1.5) to predict the Outstate variable on the training and test set. Report the RMSE and \(R^2\) of each model on the training and test set. (1 point - coding/discussion)
2. Predict out-of-state tuition using a GAM model
- GAM model:
Using the significant features from (1.3), build a generalized additive model (GAM) to predict Outstate. (see Generalized additive models (GAM) for how to build GAM models in tidymodels)
(2.1) Define a model formula setting all numerical variables as splines and all categorical variables as factors (wrap numerical predictors with s(), e.g., Outstate ~ Private + s(Apps) + ...). (1 point - coding)
(2.2) Define the gen_additive_mod model using mgcv as the engine and fit the model using the training data. (2 points - coding)
(2.3) Report the RMSE and \(R^2\) of the model on the training and test set and compare with the result from (1.6) (1 point - discussion)
(2.4) Use the plot function with the fitted model (use extract_fit_engine to get the actual mgcv model for plotting; set scale=0 to get individual \(y\)-scales). Describe your observations. (1 point - coding/discussion)
(2.5) Use the summary function to get information about the model. Based on the reported significance levels, could you simplify the model further? (1 point - coding/discussion)
(2.6) Simplify the model by removing the non-significant variables and re-fit the model. Report the RMSE and \(R^2\) of the model on the training and test set. (1 point - coding/discussion)
- Comparison:
(2.7) Compare the results from the three models from (1) (one from (1.3) and two from (1.5)), the full GAM model from (2.2) and the reduced model from (2.6). (2 points - discussion)