Module 1
Assignments (1) and (2) cover theoretical aspects of the course. Assignment (3) requires you to explore the
ISLR2::Boston
dataset using graphs. Use Tidyverse packages for assignments (3).You can download the R Markdown file (https://gedeck.github.io/DS-6030/homework/Module-1.Rmd) and use it as a basis for your solution.
1. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide the number of data points, \(n\), and the number of predictors, \(p\). (3 points)
(1.1) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
(1.2) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
(1.3) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
2. Describe the differences between a parametric and a non-parametric statistical learning approach. (2 points)
What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?
(2.1) Advantages
(2.2) Disadvantages
3. Explore the dataset ISLR2::Boston
. (11 points)
This Boston
dataset is often used as an example for regression problems. It contains the following variables:
crim
: per capita crime rate by town.zn
: proportion of residential land zoned for lots over 25,000 sq.ft.indus
: proportion of non-retail business acres per town.chas
: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox
: nitrogen oxides concentration (parts per 10 million).rm
: average number of rooms per dwelling.age
: proportion of owner-occupied units built prior to 1940.dis
: weighted mean of distances to five Boston employment centres.rad
: index of accessibility to radial highways.tax
: full-value property-tax rate per $10,000.ptratio
: pupil-teacher ratio by town.lstat
: lower status of the population (percent).medv
: median value of owner-occupied homes in $1000s.
- Create histograms and/or densityplots of each feature using
ggplot2
.
(3.1) Don’t create individual graphs. Use patchwork
to combine multiple graphs into a figure. (2 points - coding)
(3.2) Look for interesting patterns in the distributions. For example, are distributions highly skewed? Do you notice any outliers? Document your findings. (2 point - discussion)
(3.3) Are there any variables that should be transformed? (2 point - discussion)
- With the processed dataset, create three or more plots using
ggplot2
to explore the relationship between the variables. Document your findings.
(3.4) Create three or more scatterplots (2 points - coding)
(3.5) Do you see strong correlation between some of the variables? What type of correlation do you see? (2 points - discussion)
(3.6) What would be the consequence of these if you want to train a regression model? (1 points - discussion)