Module 2

In this assignment you will build a ordinary linear regression models.

Use Tidyverse and Tidymodels packages for the assignments.

You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-2.Rmd) and use it as a basis for your solution.

1. Flexible vs Inflexible Methods (2 points)

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(1.1) The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.

(1.2) The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.

(1.3) The relationship between the predictors and response is highly non-linear.

(1.4) The variance of the error terms, i.e. \(\sigma^2 = Var(\epsilon)\), is extremely high.

2. Predicting Airfare on New Routes

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One thing the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts with the dataset Airfares.csv.gz, which contains real data that were collected between Q3-1996 and Q2-1997. The variables in these data are listed in the following Table, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest has on FARE.

Variable Description
S_CODE Starting airport’s code
S_CITY Starting city
E_CODE Ending airport’s code
E_CITY Ending city
COUPON Average number of coupons (a one-coupon flight is a nonstop flight,a two-coupon flight is a one-stop flight, etc.) for that route
NEW Number of new carriers entering that route between Q3-96 and Q2-97
VACATION Whether (Yes) or not (No) a vacation route
SW Whether (Yes) or not (No) Southwest Airlines serves that route
HI Herfindahl index: measure of market concentration
S_INCOME Starting city’s average personal income
E_INCOME Ending city’s average personal income
S_POP Starting city’s population
E_POP Ending city’s population
SLOT Whether or not either endpoint airport is slot-controlled (this is a measure of airport congestion)
GATE Whether or not either endpoint airport has gate constraints (this is another measure of airport congestion)
DISTANCE Distance between two endpoint airports in miles
PAX Number of passengers on that route during period of data collection
FARE Average fare on that route
  1. Data Exploration:

(2.1) Load the data from https://gedeck.github.io/DS-6030/datasets/homework/Airfares.csv.gz and preprocess the data; convert categorical variables to factors. (1 point - coding)

(2.2) Explore the numerical (continuous) predictors and response (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. What seems to be the best single predictor of FARE? (2 points - coding/discussion)

(2.3) Explore the categorical predictors (excluding the first four) by creating individual graphs comparing the distribution of average fare for each category (e.g. box plots). Which categorical predictor seems best for predicting FARE? (2 points - coding/discussion)

  1. Find a model for predicting the average fare on a new route:

(2.4) Partition the data into training and holdout sets. The model will be fit to the training data and evaluated on the holdout set. (see DS-6030: Creating an initial split of the data into training and holdout set) (1 point - coding)

(2.5) Train a linear regression model with tidymodels using all predictors. You can ignore the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY). Examine the model coefficients and interpret them. Which predictors are significant? (see DS-6030: Linear regression models)
Determine the model performance using r^2, RMSE and MAE on the training and test set. How does the model perform on the test set? Is the model overfitting? How can you tell? (see DS-6030: Measuring performance of regression models) (2 points - coding/discussion)

(2.6) Taking the results from (2.2), (2.3), and (2.5) into account, build a model that includes only the most important predictors. Determine the model performance and compare with the full model from (2.5). (2 points - coding/discussion)

(2.7) Using the models from (2.5) and (2.6), predict the average fare on a route with the following characteristics: COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782, DISTANCE = 1976 miles. Hint: make sure that you treat the categorical variables in the same way as in the training data. (1 point - coding)

(2.8) Using the smaller model from (2.6), predict the reduction in average fare on the route in (2.7) if Southwest decides to cover this route. (1 point - coding/discussion)

  1. Predictors

(2.9) In reality, which of the factors will not be available for predicting the average fare from a new airport (i.e., before flights start operating on those routes)? Which ones can be estimated? How? (1 point - discussion)

(2.10) Train a model that includes only factors that are available before flights begin to operate on the new route. (1 point - coding)

(2.11) Compare the predictive accuracy of this model with models from (2.5) and (2.6). Is this model good enough, or is it worthwhile reevaluating the model once flights begin on the new route? (1 point - discussion)