DS-6030 Homework Module 1

Author

Note

Assignments (1) and (2) cover theoretical aspects of the course. Assignment (3) requires you to explore the ISLR2::Boston dataset using graphs.

Use Tidyverse packages for assignment (3).

You can download the Quarto Markdown file and use it as a basis for your solution.

WarningLLM use

You may use LLMs (ChatGPT, Claude, Copilot, etc.) on this assignment. If you do, you must:

  1. Disclose which LLM you used and roughly what for (concept clarification, code generation, prose review).
  2. Include the prompts for any output you used substantially. Paste them into a “LLM use” appendix at the end of your submission.
  3. Verify the output — LLMs frequently make small but plausible-sounding errors (wrong variable names, made-up tidymodels functions, wrong claims about dataset properties). Check anything you keep against the course material or the actual data.

The grader will spot-check disclosure. Undisclosed LLM use is treated as a citation failure.

1. Classification vs Regression - Inference or Prediction? (2 points)

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide the number of data points, \(n\), and the number of predictors, \(p\).

For each scenario, provide:

  • classification or regression, justify your answer
  • inference or prediction, justify your answer
  • number of data points, \(n\)
  • number of predictors, \(p\)

Example We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry, and the CEO salary. We are interested in understanding which factors affect CEO salary.

  • Regression, since the response is continuous (CEO salary).
  • Inference, since we want to understand the relationship between predictors and CEO salary.
  • \(n = 500\) firms
  • \(p = 3\) predictors

(1.1) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

(1.2) A streaming service wants to understand which user behaviors are most strongly associated with cancellation of a subscription within the first 90 days. They examine the records of 12,000 recent sign-ups. For each user, they have recorded whether the user cancelled in the first 90 days, number of titles watched in the first 30 days, total minutes streamed in the first 30 days, payment plan, device type, and six other behavioral variables.

(1.3) A regional power utility needs to forecast tomorrow’s peak electricity demand (in megawatts) so it can schedule generation capacity. They have a five-year history of daily peak demand. For each day they have recorded the peak demand, average temperature, average humidity, day of the week, and four other weather and calendar variables.

(1.4) A bank wants to flag credit card transactions that are likely fraudulent at the moment they occur. They have a labeled history of 200,000 recent transactions, each annotated as fraudulent or legitimate. For each transaction they have recorded the amount, merchant category, time of day, distance from the cardholder’s billing address, and 16 other engineered features.

2. Describe the differences between a parametric and a non-parametric statistical learning approach (4 points)

NoteLLM assignment

What are the advantages of a parametric approach to regression or classification as opposed to a non-parametric approach? What are its disadvantages?

(2.1) Ask an LLM to list two advantages of a parametric approach. Paste the LLM’s answer. Then critique it: which items are well-supported, which are overstated or misleading, and what (if anything) is missing? Add at least one advantage the LLM did not produce. (2 points - discussion)

(2.2) This time, you write first, then have the LLM critique your answer. List two disadvantages of a parametric approach. Paste your answer into the LLM and ask it to critique your answer. What do you think of the LLM’s critique? Is it justified? What (if anything) is missing from the LLM’s critique? (2 points - discussion)

Hint: If the LLM’s critique is generic or just agrees with you, push back with a follow-up prompt asking for specific weaknesses, missing points, or factual errors. Include that follow-up in your submission.

3. Explore the dataset ISLR2::Boston (12 points)

This Boston dataset is often used as an example for regression problems. It contains the following variables:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.
  1. Create histograms and/or density plots of each feature using ggplot2:

(3.1) Combine all per-feature plots into a single figure using patchwork rather than producing one figure per feature. (2 points - coding)

(3.2) Look for interesting patterns in the distributions. For example, are distributions highly skewed? Do you notice any outliers? Document your findings. (2 points - discussion)

(3.3) Are there any variables that should be transformed? (2 points - discussion)

  1. With the processed dataset, create three or more plots using ggplot2 to explore the relationship between the variables. Document your findings:

(3.4) Create three or more scatterplots between selected predictors and medv. (2 points - coding)

(3.5) Do you see strong correlation between some of the variables? What type of correlation do you see? (2 points - discussion)

(3.6) What would be the consequence of the correlations you identified in (3.5) if you want to train a regression model? (1 point - discussion)

NoteLLM assignment

(3.7) Pick one of the graphs you produced in (3.1) or (3.4) and submit it to a multimodal LLM (e.g., ChatGPT, Claude, Gemini) with a request to interpret what the graph shows. Paste the LLM’s response. Then evaluate it: do you agree with the interpretation? Did the LLM point someting out that surprised you? What (if anything) does the LLM get wrong (e.g., misread values, hallucinated labels, missing context about the variables)? Add anything important the LLM missed. (1 point - discussion)

Hint: Multimodal LLMs often misread specific values or invent labels. Cross-check the LLM’s claims against the graph and the underlying data.