DS-6030 Homework Module 7

Author

Note

You can download the Quarto Markdown file and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Predicting Prices of Used Cars (Regression Trees) (17 points)

The dataset contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 variables, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.

Load the Toyota Corolla dataset.

  1. Load and preprocess the data:

(1.1) Load and preprocess the data. Convert all relevant variables to factors. (2 points - coding)

(1.2) Split the data into training (60%) and test (40%) datasets. (1 point - coding)

  1. Large tree:

In (1.3) below, define and fit the workflow for a model to predict the outcome variable Price using the following predictors: Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Use the following settings depending on the model engine (rpart or partykit) you choose:

  • rpart engine: Keep the minimum number of records in a terminal node to 2 (min_n = 2), maximum number of tree levels to 30 (tree_depth), and \(cost\_complexity = 0.0005\) (cost_complexity) to make the run least restrictive resulting in a large tree.

  • partykit engine: Keep the minimum number of records in a terminal node to 2 (min_n = 2) and maximum number of tree levels to 30 (tree_depth) to make the run least restrictive resulting in a large tree.

(1.3) Fit a model using the full training dataset. Plot or print the resulting tree and identify the three most important car specifications for predicting the car’s price based on the variables used in the top splits (root and the two next levels). (1 point - coding/discussion)

(1.4) Submit your fitted tree from (1.3) to an LLM (paste the printed tree or list the top three splits) and ask: “Are the variables at the top of this tree the most important predictors of car price?”

Evaluate the LLM’s answer:

  • Does it engage with how trees actually choose splits — greedily and locally, so a globally informative variable can be masked by a correlated predictor, and predictors with more candidate split points (continuous, high-cardinality factors) win more often than binary indicators regardless of true importance?
  • Or does it default to “top splits = most important features”?

Point to one place in your fitted tree that supports or contradicts the LLM’s answer (e.g. a binary indicator that never splits, or two correlated continuous predictors where only one is used).

(2 points - discussion)

(1.5) Determine the prediction errors of the training and test sets by examining their RMS error. How does the predictive performance of the test set compare to the training set? Why does this occur? (1 point - coding/discussion)

(1.6) How might we achieve better test predictive performance at the expense of training performance? (1 point - discussion)

  1. Smaller tree:

(1.7) Create a smaller tree. Compared to the deeper tree, what is the predictive performance on the test set? Comment on whether the smaller tree underfits. (3 points - coding/discussion)

  • rpart engine: min_n=2, tree_depth=3, cost_complexity=0.001

  • partykit engine: min_n=2, tree_depth=3

  1. Tuned tree:

(1.8) Now define a workflow that tunes the decision tree. Define a suitable range for the tuning parameter and use a tuning strategy of your choice. Use autoplot on the tuning results to check the range — if the best value sits at an endpoint, extend the range and re-tune until the optimum is interior. (2 points - coding)

  • rpart engine: tune cost_complexity

  • partykit engine: tune tree_depth

(1.9) What is the best value of your tuning parameter? What is the predictive performance of the resulting model on the test set? (1 point - discussion)

(1.10) How does the predictive performance of the tuned model compare to the models from (1.3) and (1.7)? What do you observe? (1 point - discussion)

(1.11) Train a final model for the optimal tuning parameters and visualize the resulting tree. (1 point - coding/discussion)

  1. Predicting the price of a car:

(1.12) Given the various models, what is the predicted price for a car with the following characteristics (make sure to handle the categorical variables correctly):

Age_08_04=77, KM=117000, Fuel_Type=Petrol, HP=110, Automatic=No, Doors=5, Quarterly_Tax=100, Mfr_Guarantee=No, Guarantee_Period=3, Airco=Yes, Automatic_airco=No, CD_Player=No, Powered_Windows=No, Sport_Model=No, Tow_Bar=Yes

Report all three predictions (large tree, small tree, tuned tree) and discuss differences. (1 point - coding/discussion)