DS-6030 Homework Module 6
You can download the Quarto Markdown file and use it to answer the following questions.
If not otherwise stated, use Tidyverse and Tidymodels for the assignments.
You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:
- Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
- Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
- Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up
tidymodelsfunctions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.
The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.
1. Predict out-of-state tuition with feature selection (8 points)
The data College.csv contains a number of variables for 777 different universities and colleges in the US. In this exercise, we will try to predict the Outstate tuition fee using the other variables in the data set.
- Data loading and preprocessing:
(1.1) Load the data from ISLR2::College and split into training and holdout sets using an 80/20 split. Use set.seed() to ensure that the split is reproducible. (1 point - coding)
- Train linear regression and Lasso models:
Use tidymodels to define a workflow and build linear regression models to predict Outstate from all the other variables, both without and with L1 regularization (Lasso).
(1.2) For preprocessing, normalize all the numerical variables (step_normalize(all_numeric_predictors())) and convert the categorical / nominal variables to dummy variables (step_dummy(all_nominal_predictors())). Use the same recipe in both (1.3) and (1.4). (1 point - coding)
(1.3) Train a normal linear regression model using the lm engine with the training set. Look at the \(p\)-values of the individual features. Which features are significant? (use extract_fit_engine and summary) (1 point - coding/discussion)
(1.4) Use glmnet and tune the L1 penalty parameter using 10-fold cross-validation. Tune the penalty over the range penalty(range = c(-1, 2.5)) and check that the range is appropriate using autoplot. (2 points - coding)
(1.5) Determine the best penalty parameters using the lowest \(RMSE\) and the penalty obtained from the one-standard-error rule (select_by_one_std_err, see online material). For both penalties, finalize the workflow and train models using the full training set. Report the coefficients of each model. Which variables are selected in each case? (2 points - coding/discussion)
(1.6) Use the model from (1.3) and the two models from (1.5) to predict the Outstate variable on the training and test set. Report the RMSE and \(R^2\) of each model on the training and test set. (1 point - coding/discussion)
2. Predict out-of-state tuition using a GAM model (11 points)
- GAM model:
Using the significant features from (1.3), build a generalized additive model (GAM) to predict Outstate. (see Generalized additive models (GAM) for how to build GAM models in tidymodels)
(2.1) Define a model formula setting all numerical variables as splines and all categorical variables as factors (wrap numerical predictors with s(), e.g., Outstate ~ Private + s(Apps) + ...). (1 point - coding)
(2.2) Define the gen_additive_mod model using mgcv as the engine and fit the model using the training data. (2 points - coding)
(2.3) Report the RMSE and \(R^2\) of the model on the training and test set and compare with the result from (1.6) (1 point - discussion)
(2.4) Use the plot function with the fitted model (use extract_fit_engine to get the actual mgcv model for plotting; set scale=0 to get individual \(y\)-scales). Describe your observations: which terms look approximately linear vs. clearly non-linear, and how wide are the confidence bands relative to the partial effect? (1 point - coding/discussion)
(2.5) Use the summary function to get information about the model. Based on the reported significance levels, could you simplify the model further? (1 point - coding/discussion)
(2.6) Paste the smooth-term table from the summary() output in (2.5) (the rows with edf and the smoothness p-values) into an LLM and ask it to interpret three smooth terms: the term with the lowest edf in your fit, the term with the highest edf, and one term with edf close to 2.
Evaluate the LLM’s interpretations:
- Does it correctly read
edfnear 1 as “effectively linear” and suggest replacings(x)with the rawx? - Does it distinguish the smoothness p-value (does the fitted curve differ from a flat line?) from a generic “is this predictor important” p-value?
- Does it slip into causal language when describing the partial-effect plots from (2.4) (e.g. “increasing X causes Y to rise”), or does it stay correlational?
Rewrite any sloppy or causal interpretation in your own words.
(2 points - discussion)
(2.7) Simplify the model by removing the non-significant variables and re-fit the model. Report the RMSE and \(R^2\) of the model on the training and test set. (1 point - coding/discussion)
- Comparison:
(2.8) Compare the results from the linear regression model from (1.3), the two Lasso models from (1.5), the full GAM model from (2.2), and the reduced GAM model from (2.7). (2 points - discussion)