DS-6030 Homework Module 2

Author

Note

In this assignment you will build ordinary linear regression models.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the Quarto Markdown file and use it as a basis for your solution.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Flexible vs Inflexible Methods (3 points)

For each of parts (1.1) through (1.4), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(1.1) The sample size \(n\) is extremely large, and the number of predictors \(p\) is small. (0.5 points - discussion)

(1.2) The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small. (0.5 points - discussion)

(1.3) The relationship between the predictors and response is highly non-linear. (0.5 points - discussion)

(1.4) The variance of the error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high. (0.5 points - discussion)

NoteLLM assignment

(1.5) Pick any one of (1.1)(1.4). Ask an LLM to argue the opposite recommendation from the one you gave. Is the counter-argument valid? Under what data conditions might the opposite choice actually be the right one? (1 point - discussion)

2. Predicting Airfare on New Routes (16 points)

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One thing the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts with the dataset Airfares.csv.gz, which contains real data that were collected between Q3-1996 and Q2-1997. The variables in these data are listed in the following Table, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest has on FARE.

Variable Description
S_CODE Starting airport’s code
S_CITY Starting city
E_CODE Ending airport’s code
E_CITY Ending city
COUPON Average number of coupons (a one-coupon flight is a nonstop flight, a two-coupon flight is a one-stop flight, etc.) for that route
NEW Number of new carriers entering that route between Q3-96 and Q2-97
VACATION Whether (Yes) or not (No) a vacation route
SW Whether (Yes) or not (No) Southwest Airlines serves that route
HI Herfindahl index: measure of market concentration
S_INCOME Starting city’s average personal income
E_INCOME Ending city’s average personal income
S_POP Starting city’s population
E_POP Ending city’s population
SLOT Whether or not either endpoint airport is slot-controlled (this is a measure of airport congestion)
GATE Whether or not either endpoint airport has gate constraints (this is another measure of airport congestion)
DISTANCE Distance between two endpoint airports in miles
PAX Number of passengers on that route during period of data collection
FARE Average fare on that route
  1. Data Exploration:

(2.1) Load the Airfares.csv.gz dataset and preprocess the data; drop unnecessary columns and convert categorical variables to factors. Check for missing values and handle them appropriately. (1 point - coding)

(2.2) Explore the numerical (continuous) predictors and response (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. What seems to be the best single predictor of FARE? (2 points - coding/discussion)

(2.3) Explore the categorical predictors (excluding the first four) by creating individual graphs comparing the distribution of average fare for each category (e.g. box plots). Which categorical predictor seems best for predicting FARE? (2 points - coding/discussion)

  1. Find a model for predicting the average fare on a new route:

(2.4) Partition the data into training and holdout sets. The model will be fit to the training data and evaluated on the holdout set. (see Machine learning with tidymodels: Creating an initial split of the data into training and holdout set) (1 point - coding)

(2.5) Train a linear regression model with tidymodels using all predictors. You can ignore the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY).

NoteLLM assignment

(2.6) After fitting the model from (2.5), paste the data dictionary and the coefficient table into an LLM and ask it to interpret three coefficients of your choice. Evaluate each interpretation: is it a correct correlational claim, or does the LLM slide into causal language (“X causes Y”, “increasing X will raise Y”)? Rewrite any problematic interpretation to be correctly correlational. (1 point - discussion)

(2.7) Taking the results from (2.2), (2.3), and (2.5) into account, build a model that includes only the most important predictors you identified. Determine the model performance and compare with the full model from (2.5). (2 points - coding/discussion)

(2.8) Using the models from (2.5) and (2.7), predict the average fare on a route with the following characteristics: COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782, DISTANCE = 1976 miles. Hint: make sure that you treat the categorical variables in the same way as in the training data. (1 point - coding)

(2.9) Using the smaller model from (2.7), predict the reduction in average fare on the route in (2.8) if Southwest decides to cover this route. (1 point - coding/discussion)

  1. Predictors:

(2.10) In reality, which of the factors will not be available for predicting the average fare from a new airport (i.e., before flights start operating on those routes)? Which ones can be estimated? How? (1 point - discussion)

(2.11) Train a model that includes only factors that are available before flights begin to operate on the new route. (1 point - coding)

(2.12) Compare the predictive accuracy of this model with models from (2.5) and (2.7). Is this model good enough, or is it worthwhile reevaluating the model once flights begin on the new route? (1 point - discussion)