DS-6030 Homework Module 7
You can download the Quarto Markdown file and use it to answer the following questions.
If not otherwise stated, use Tidyverse and Tidymodels for the assignments.
You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:
- Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
- Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
- Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up
tidymodelsfunctions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.
The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.
1. Predicting Prices of Used Cars (Regression Trees)
The dataset contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 variables, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.
Load the Toyota Corolla dataset.
- Load and preprocess the data:
(1.1) Load and preprocess the data. Convert all relevant variables to factors. (2 points - coding)
(1.2) Split the data into training (60%) and test (40%) datasets. (1 point - coding)
- Large tree:
Define a workflow for a model to predict the outcome variable Price using the following predictors: Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Use the following settings depending on the model engine (rpart or partykit) you choose:
rpartengine: Keep the minimum number of records in a terminal node to 2 (min_n = 2), maximum number of tree levels to 30 (tree_depth), and \(cost\_complexity = 0.0005\) (cost_complexity) to make the run least restrictive resulting in a large tree.partykitengine: Keep the minimum number of records in a terminal node to 2 (min_n = 2) and maximum number of tree levels to 30 (tree_depth) to make the run least restrictive resulting in a large tree.
(1.3) Fit a model using the full training dataset. Inspect the resulting tree. Which appear to be the three most important car specifications for predicting the car’s price? (1 point - coding/discussion)
(1.4) Determine the prediction errors of the training and test sets by examining their RMS error. How does the predictive performance of the test set compare to the training set? Why does this occur? (1 point - coding/discussion)
(1.5) How might we achieve better test predictive performance at the expense of training performance? (1 point - discussion)
- Smaller tree:
(1.6) Create a smaller tree. Compared to the deeper tree, what is the predictive performance on the test set? Comment on whether the smaller tree underfits. (3 points - coding/discussion)
rpartengine:min_n=2,tree_depth=3,cost_complexity=0.001partykitengine:min_n=2,tree_depth=3
- Tuned tree:
(1.7) Now define a workflow that tunes the decision tree. Define a suitable range for the tuning parameter and use a tuning strategy of your choice. Make sure that the resulting best parameter is within the given range. (2 points - coding)
rpartengine: tunecost_complexitypartykitengine: tunetree_depth
(1.8) What is the best value of your tuning parameter? What is the predictive performance of the resulting model on the test set? (1 point - discussion)
(1.9) How does the predictive performance of the tuned model compare to the models from (1.3) and (1.6)? What do you observe? (1 point - discussion)
(1.10) Train a final model for the optimal tuning parameters and visualize the resulting tree. (1 point - coding/discussion)
- Predicting the price of a car:
(1.11) Given the various models, what is the predicted price for a car with the following characteristics (make sure to handle the categorical variables correctly):
Age_08_04=77,KM=117000,Fuel_Type=Petrol,HP=110,Automatic=No,Doors=5,Quarterly_Tax=100,Mfr_Guarantee=No,Guarantee_Period=3,Airco=Yes,Automatic_airco=No,CD_Player=No,Powered_Windows=No,Sport_Model=No,Tow_Bar=Yes
Report all three predictions (large tree, small tree, tuned tree) and discuss differences. (1 point - coding/discussion)