Module 7

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-7.Rmd) and use it to answer the following questions.

If not otherwise stated, use Tidyverse and Tidymodels for the assignments.

1. Predicting Prices of Used Cars (Regression Trees)

The dataset contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 variables, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.gzc

Load the data from https://gedeck.github.io/DS-6030/datasets/homework/ToyotaCorolla.csv.gz.

  1. Load and preprocess the data

(1.1) Load and preprocess the data. Convert all relevant variables to factors. (2 points - coding)

(1.2) Split the data into training (60%), and test (40%) datasets. (1 point - coding)

  1. Large tree:

Define a workflow for a model to predict the outcome variable Price using the following predictors: Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Use the following settings depending on the used model engine:

  • rpart engine: Keep the minimum number of records in a terminal node to 2 (min_n = 2), maximum number of tree levels to 30 (tree_depth), and \(cost\_complexity = 0.0005\) (cost_complexity) to make the run least restrictive resulting in a large tree.

  • partykit engine: Keep the minimum number of records in a terminal node to 2 (min_n = 2) and maximum number of tree levels to 30 (tree_depth) to make the run least restrictive resulting in a large tree.

(1.3) Fit a model using the full training dataset. Inspect the resulting tree. Which appear to be the three or four most important car specifications for predicting the car’s price? (1 point - coding/discussion)

(1.4) Determine the prediction errors of the training and test sets by examining their RMS error. How does the predictive performance of the test set compare to the training set? Why does this occur? (1 point - coding/discussion)

(1.5) How might we achieve better test predictive performance at the expense of training performance? (1 point - discussion)

  1. Smaller tree:

(1.6) Create a smaller tree. Compared to the deeper tree, what is the predictive performance on the test set? (3 points - coding/discussion)

  • rpart engine: min_n=2, tree_depth=3, cost_complexity=0.001

  • partykit engine: min_n=2, tree_depth=3

  1. Tuned tree:

(1.7) Now define a workflow that tunes the decision tree. Define a suitable range for the tuning parameter and use a tuning strategy of your choice. Make sure that the resulting best parameter is within the given range. (2 points - coding)

  • rpart engine: tune cost_complexity

  • partykit engine: tune tree_depth

(1.8) What is the best value for of your tuning parameter? What is the predictive performance of the resulting model on the test set? (1 point - discussion)

(1.9) How does the predictive performance of the tuned model compare to the models from (1.3) and (1.6)? What do you observe? (1 point - discussion)

(1.10) Train a final model for the optimal tuning parameters and visualize the resulting tree. (1 point - coding/discussion)

  1. Predicting the price of a car:

(1.11) Given the various models, what is the predicted price for a car with the following characteristics (make sure to handle the categorical variables correctly): (1 point - coding/discussion)

Age_08_04=77, KM=117000, Fuel_Type=Petrol, HP=110, Automatic=No, Doors=5, Quarterly_Tax=100, Mfr_Guarantee=No, Guarantee_Period=3, Airco=Yes, Automatic_airco=No, CD_Player=No, Powered_Windows=No, Sport_Model=No, Tow_Bar=Yes