Chapter 20 Linear regression models
The DS-6030 course primarily focuses on predictive modeling. However, it is useful to know how to access the model parameters and interpret them for linear regression models. In this section, we will use the tidymodels
framework to build a linear regression model and then look at the model parameters and diagnostics.
Load required libraries
Code
20.1 Build a linear regression model
We will use the example from chapter 8 here.
Code
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
model <- linear_reg(engine="lm", mode="regression") %>%
fit(formula, data=data)
20.2 Analyze model parameters
The tidyverse
framework introduced the function tidy
which is a generic function similar to plot
or predict
. This means, there are implementations for objects of different types. The parsnip
package also provides an implementations of the tidy
function that, works for fitted models. Using it with our linear regression model, extracts the model parameters. Table 20.1 shows the model parameters extracted with the tidy
function.
Code
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 12.303 | 18.718 | 0.657 | 0.518 |
cyl | -0.111 | 1.045 | -0.107 | 0.916 |
disp | 0.013 | 0.018 | 0.747 | 0.463 |
hp | -0.021 | 0.022 | -0.987 | 0.335 |
drat | 0.787 | 1.635 | 0.481 | 0.635 |
wt | -3.715 | 1.894 | -1.961 | 0.063 |
qsec | 0.821 | 0.731 | 1.123 | 0.274 |
vsstraight | 0.318 | 2.105 | 0.151 | 0.881 |
ammanual | 2.520 | 2.057 | 1.225 | 0.234 |
gear | 0.655 | 1.493 | 0.439 | 0.665 |
carb | -0.199 | 0.829 | -0.241 | 0.812 |
20.3 Extract model statistics
The broom
package provides the function glance
to extract model statistics from a fitted linear regression model. Table 20.2 shows the model statistics for our case.
Code
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.869 | 0.807 | 2.65 | 13.932 | 0 | 10 | -69.855 | 163.71 | 181.299 | 147.494 | 21 | 32 |
20.4 Diagnostics plots
Theggfortify
package provides implementations of the generic autoplot
function that creates up to six diagnostics plots using ggplot
. The plots are selected using the which
argument. The default creates four plots (1, 2, 3, and 5).
which | Plot |
---|---|
1 | Residuals vs Fitted |
2 | Normal Q-Q |
3 | Scale-Location |
4 | Cook’s distance |
5 | Residuals vs Leverage |
6 | Cooks’s distance vs Leverage |
20.4.1 Residuals vs Fitted
The argument which=1
creates a graph of residuals vs fitted (see Figure 20.1). This type of plot is useful for all types of regression, not only linear regressions models.
The blue line is a smoother of the residuals. The line should be a flat line and show no trend in the data. Looking at the plot, we can see that there is some non-linearity in the residuals. This could be an indication to explore variable transformations or polynomial terms.
The plot can also reveal heteroscedasticity of the residuals. It is however more qualitative; the scale-location plot can show this better.
20.4.2 Normal Q-Q plot
The normal Q-Q plot looks at the distribution of the residuals. The normal Q-Q plot in Figure 20.1 shows that in our case, residuals are normally distributed.
20.4.3 Scale-location plot
The scale location plot can reveal changes in variance of the residuals easier. Figure 20.2, shows that in our case, the variation doesn’t change much.
20.4.4 Cook’s distance plot
The Cook’s distance measures the influence of a data point on the regression. Figure 20.2, shows that two data points stand out (9 and 29).
Some suggest that points with a Cook’s distance greater than 1 should be looked at. None of our data points are flagged in this case. Others suggest \(4/(n-p-1)\) as a threshold. For our model, \(n=32\) and \(p=10\), so \(4/(n-p-1) = 4/(32-10-1) = 0.19\). This will flag 9 and 29 as points of interest.
20.4.5 Residuals vs Leverage
Leverage is another measure of how influential a data point is. The larger the leverage, the more the data point affects the regression. In our case (Figure 20.3), no point stands out.
20.4.6 Cooks’s distance vs Leverage
The final plot shows Cook’s distance vs leverage (Figure 20.3). Again, look for points with large Cook’s distance or large leverage.
Further information: Additional information can be found in the following resources:
- https://bookdown.org/dereksonderegger/571/7-Diagnostics-Chapter.html
- Comprehensive slide deck by John Fox, the author of the book “Regression Diagnostics” https://socialsciences.mcmaster.ca/jfox/Courses/Brazil-2009/slides-handout.pdf
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)
library(tidymodels)
library(kableExtra)
library(ggfortify) # required for diagnostics plots
data <- datasets::mtcars %>%
as_tibble(rownames="car") %>%
mutate(
vs = factor(vs, labels=c("V-shaped", "straight")),
am = factor(am, labels=c("automatic", "manual")),
)
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
model <- linear_reg(engine="lm", mode="regression") %>%
fit(formula, data=data)
model %>%
tidy() %>%
knitr::kable(digits=3, caption="Linear regression model parameters extracted using the tidy function")
model %>%
glance() %>%
knitr::kable(digits=3,
caption="Linear regression model statistics extracted using the glance function") %>%
scroll_box(width = "100%")
tibble(
which=c(1, 2, 3, 4, 5, 6),
Plot=c("Residuals vs Fitted", "Normal Q-Q",
"Scale-Location", "Cook's distance",
"Residuals vs Leverage", "Cooks's distance vs Leverage")
) %>%
kableExtra::kbl(caption="Plots created by `which` argument") %>%
kableExtra::kable_styling(full_width=FALSE)
autoplot(model, which=c(1, 2))
autoplot(model, which=c(3, 4))
autoplot(model, which=c(5, 6))