DS-6030 Homework Module 2
In this assignment you will build ordinary linear regression models.
Use Tidyverse and Tidymodels packages for the assignments.
You can download the Quarto Markdown file and use it as a basis for your solution.
You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:
- Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
- Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
- Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up
tidymodelsfunctions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.
The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.
1. Flexible vs Inflexible Methods (3 points)
For each of parts (1.1) through (1.4), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(1.1) The sample size \(n\) is extremely large, and the number of predictors \(p\) is small. (0.5 points - discussion)
(1.2) The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small. (0.5 points - discussion)
(1.3) The relationship between the predictors and response is highly non-linear. (0.5 points - discussion)
(1.4) The variance of the error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high. (0.5 points - discussion)
(1.5) Pick any one of (1.1)–(1.4). Ask an LLM to argue the opposite recommendation from the one you gave. Is the counter-argument valid? Under what data conditions might the opposite choice actually be the right one? (1 point - discussion)
2. Predicting Airfare on New Routes (16 points)
The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One thing the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts with the dataset Airfares.csv.gz, which contains real data that were collected between Q3-1996 and Q2-1997. The variables in these data are listed in the following Table, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest has on FARE.
| Variable | Description |
|---|---|
| S_CODE | Starting airport’s code |
| S_CITY | Starting city |
| E_CODE | Ending airport’s code |
| E_CITY | Ending city |
| COUPON | Average number of coupons (a one-coupon flight is a nonstop flight, a two-coupon flight is a one-stop flight, etc.) for that route |
| NEW | Number of new carriers entering that route between Q3-96 and Q2-97 |
| VACATION | Whether (Yes) or not (No) a vacation route |
| SW | Whether (Yes) or not (No) Southwest Airlines serves that route |
| HI | Herfindahl index: measure of market concentration |
| S_INCOME | Starting city’s average personal income |
| E_INCOME | Ending city’s average personal income |
| S_POP | Starting city’s population |
| E_POP | Ending city’s population |
| SLOT | Whether or not either endpoint airport is slot-controlled (this is a measure of airport congestion) |
| GATE | Whether or not either endpoint airport has gate constraints (this is another measure of airport congestion) |
| DISTANCE | Distance between two endpoint airports in miles |
| PAX | Number of passengers on that route during period of data collection |
| FARE | Average fare on that route |
- Data Exploration:
(2.1) Load the Airfares.csv.gz dataset and preprocess the data; drop unnecessary columns and convert categorical variables to factors. Check for missing values and handle them appropriately. (1 point - coding)
(2.2) Explore the numerical (continuous) predictors and response (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. What seems to be the best single predictor of FARE? (2 points - coding/discussion)
(2.3) Explore the categorical predictors (excluding the first four) by creating individual graphs comparing the distribution of average fare for each category (e.g. box plots). Which categorical predictor seems best for predicting FARE? (2 points - coding/discussion)
- Find a model for predicting the average fare on a new route:
(2.4) Partition the data into training and holdout sets. The model will be fit to the training data and evaluated on the holdout set. (see Machine learning with tidymodels: Creating an initial split of the data into training and holdout set) (1 point - coding)
(2.5) Train a linear regression model with tidymodels using all predictors. You can ignore the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY).
Examine the model coefficients and interpret them. Which predictors are significant? (see Machine learning with tidymodels: Linear regression models)
Determine the model performance using
r^2,RMSEandMAEon the training and test set. How does the model perform on the test set? Is the model overfitting? How can you tell? (see Machine learning with tidymodels: Measuring performance of regression models) (2 points - coding/discussion)
(2.6) After fitting the model from (2.5), paste the data dictionary and the coefficient table into an LLM and ask it to interpret three coefficients of your choice. Evaluate each interpretation: is it a correct correlational claim, or does the LLM slide into causal language (“X causes Y”, “increasing X will raise Y”)? Rewrite any problematic interpretation to be correctly correlational. (1 point - discussion)
(2.7) Taking the results from (2.2), (2.3), and (2.5) into account, build a model that includes only the most important predictors you identified. Determine the model performance and compare with the full model from (2.5). (2 points - coding/discussion)
(2.8) Using the models from (2.5) and (2.7), predict the average fare on a route with the following characteristics: COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782, DISTANCE = 1976 miles. Hint: make sure that you treat the categorical variables in the same way as in the training data. (1 point - coding)
(2.9) Using the smaller model from (2.7), predict the reduction in average fare on the route in (2.8) if Southwest decides to cover this route. (1 point - coding/discussion)
- Predictors:
(2.10) In reality, which of the factors will not be available for predicting the average fare from a new airport (i.e., before flights start operating on those routes)? Which ones can be estimated? How? (1 point - discussion)
(2.11) Train a model that includes only factors that are available before flights begin to operate on the new route. (1 point - coding)
(2.12) Compare the predictive accuracy of this model with models from (2.5) and (2.7). Is this model good enough, or is it worthwhile reevaluating the model once flights begin on the new route? (1 point - discussion)