Chapter 20 Linear regression models

The DS-6030 course primarily focuses on predictive modeling. However, it is useful to know how to access the model parameters and interpret them for linear regression models. In this section, we will use the tidymodels framework to build a linear regression model and then look at the model parameters and diagnostics.

Load required libraries

Code
library(tidyverse)
library(tidymodels)
library(kableExtra)
library(ggfortify)  # required for diagnostics plots

20.1 Build a linear regression model

We will use the example from chapter 8 here.

Code
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
model <- linear_reg(engine="lm", mode="regression") %>% 
    fit(formula, data=data)

20.2 Analyze model parameters

The tidyverse framework introduced the function tidy which is a generic function similar to plot or predict. This means, there are implementations for objects of different types. The parsnip package also provides an implementations of the tidy function that, works for fitted models. Using it with our linear regression model, extracts the model parameters. Table 20.1 shows the model parameters extracted with the tidy function.

Code
model %>% 
    tidy() %>% 
    knitr::kable(digits=3, caption="Linear regression model parameters extracted using the tidy function")
Table 20.1: Linear regression model parameters extracted using the tidy function
term estimate std.error statistic p.value
(Intercept) 12.303 18.718 0.657 0.518
cyl -0.111 1.045 -0.107 0.916
disp 0.013 0.018 0.747 0.463
hp -0.021 0.022 -0.987 0.335
drat 0.787 1.635 0.481 0.635
wt -3.715 1.894 -1.961 0.063
qsec 0.821 0.731 1.123 0.274
vsstraight 0.318 2.105 0.151 0.881
ammanual 2.520 2.057 1.225 0.234
gear 0.655 1.493 0.439 0.665
carb -0.199 0.829 -0.241 0.812

20.3 Extract model statistics

The broom package provides the function glance to extract model statistics from a fitted linear regression model. Table 20.2 shows the model statistics for our case.

Code
model %>% 
    glance() %>% 
    knitr::kable(digits=3, 
        caption="Linear regression model statistics extracted using the glance function") %>% 
    scroll_box(width = "100%")
Table 20.2: Table 20.3: Linear regression model statistics extracted using the glance function
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.869 0.807 2.65 13.932 0 10 -69.855 163.71 181.299 147.494 21 32

20.4 Diagnostics plots

The ggfortify package provides implementations of the generic autoplot function that creates up to six diagnostics plots using ggplot. The plots are selected using the which argument. The default creates four plots (1, 2, 3, and 5).
Table 20.4: Plots created by which argument
which Plot
1 Residuals vs Fitted
2 Normal Q-Q
3 Scale-Location
4 Cook’s distance
5 Residuals vs Leverage
6 Cooks’s distance vs Leverage

20.4.1 Residuals vs Fitted

The argument which=1 creates a graph of residuals vs fitted (see Figure 20.1). This type of plot is useful for all types of regression, not only linear regressions models.

Code
autoplot(model, which=c(1, 2))
Diagnostics plots: residuals vs fitted (`which=1`) and Normal Q-Q plot (`which=2`)

Figure 20.1: Diagnostics plots: residuals vs fitted (which=1) and Normal Q-Q plot (which=2)

The blue line is a smoother of the residuals. The line should be a flat line and show no trend in the data. Looking at the plot, we can see that there is some non-linearity in the residuals. This could be an indication to explore variable transformations or polynomial terms.

The plot can also reveal heteroscedasticity of the residuals. It is however more qualitative; the scale-location plot can show this better.

20.4.2 Normal Q-Q plot

The normal Q-Q plot looks at the distribution of the residuals. The normal Q-Q plot in Figure 20.1 shows that in our case, residuals are normally distributed.

20.4.3 Scale-location plot

Diagnostics plots: scale location plot (`which=3`) and Cook's distance plot (`which=4`)

Figure 20.2: Diagnostics plots: scale location plot (which=3) and Cook’s distance plot (which=4)

The scale location plot can reveal changes in variance of the residuals easier. Figure 20.2, shows that in our case, the variation doesn’t change much.

20.4.4 Cook’s distance plot

The Cook’s distance measures the influence of a data point on the regression. Figure 20.2, shows that two data points stand out (9 and 29).

Some suggest that points with a Cook’s distance greater than 1 should be looked at. None of our data points are flagged in this case. Others suggest \(4/(n-p-1)\) as a threshold. For our model, \(n=32\) and \(p=10\), so \(4/(n-p-1) = 4/(32-10-1) = 0.19\). This will flag 9 and 29 as points of interest.

20.4.5 Residuals vs Leverage

Diagnostics plots: residuals vs leverage (`which=5`) and Cook's distance vs leverage (`which=6`)

Figure 20.3: Diagnostics plots: residuals vs leverage (which=5) and Cook’s distance vs leverage (which=6)

Leverage is another measure of how influential a data point is. The larger the leverage, the more the data point affects the regression. In our case (Figure 20.3), no point stands out.

20.4.6 Cooks’s distance vs Leverage

The final plot shows Cook’s distance vs leverage (Figure 20.3). Again, look for points with large Cook’s distance or large leverage.

Further information: Additional information can be found in the following resources:

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)
library(tidymodels)
library(kableExtra)
library(ggfortify)  # required for diagnostics plots
data <- datasets::mtcars %>% 
    as_tibble(rownames="car") %>%
    mutate(
        vs = factor(vs, labels=c("V-shaped", "straight")),
        am = factor(am, labels=c("automatic", "manual")),
    )
formula <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
model <- linear_reg(engine="lm", mode="regression") %>% 
    fit(formula, data=data)
model %>% 
    tidy() %>% 
    knitr::kable(digits=3, caption="Linear regression model parameters extracted using the tidy function")
model %>% 
    glance() %>% 
    knitr::kable(digits=3, 
        caption="Linear regression model statistics extracted using the glance function") %>% 
    scroll_box(width = "100%")
tibble(
    which=c(1, 2, 3, 4, 5, 6),
    Plot=c("Residuals vs Fitted", "Normal Q-Q", 
      "Scale-Location", "Cook's distance", 
      "Residuals vs Leverage", "Cooks's distance vs Leverage")
) %>% 
    kableExtra::kbl(caption="Plots created by `which` argument") %>%
    kableExtra::kable_styling(full_width=FALSE)
autoplot(model, which=c(1, 2))
autoplot(model, which=c(3, 4))
autoplot(model, which=c(5, 6))