Chapter 29 Defining models using formulae
29.1 Linear models
In R, statistical models are usually defined using a formula. Here is an example:
y ~ x1 + x2 + x3
This formula describes a linear model of the form:
\[
y = c_1 x1 + c_2 x2 + c_3 x3 + y_0
\]
The outcome \(y\) is a linear combination of the predictors \(x1\), \(x2\), and \(x3\) with coefficients \(c_1\), \(c_2\), and \(c_3\). \(y_0\) is a constant intercept. Note how the intercept is not included in the model definition. If you want to make the intercept explicit you can write the following formula where 1
represents the constant, but modeled intercept \(y_0\).
y ~ x1 + x2 + x3 + 1
To exclude the intercept and fit a linear model without intercept, use one of the following options:
y ~ x1 + x2 + x3 - 1
y ~ x1 + x2 + x3 + 0
If your variable names contain spaces, you need to surround the name in the formula with the backtick character `
.
quality ~ `fixed acidity` + `volatile acidity` + chlorides
Here, “fixed acidity” and “volatile acidity” are names of variables or columns in a data frame.
You will frequently come across formulas like y ~ .
in the following linear regression model:
model <- lm(y ~ ., data=df)
Here, the .
stands for “all columns not otherwise in the formula”. For example, if the tibble (or data frame) contains columns a
, b
, c
, and y
, then y ~ .
is equivalent to the explicit formula y ~ a + b + c
.
While it might be tempting to use the shortcut
y ~ .
, it is better to be explicit and list all terms to ensure reproducibility. Even if we know exactly what is included in a dataset when we develop the model, data can change over time. For example, if data are downloaded from an external source and columns are added, model results will change. Another source of issues will be columns added to the dataset during exploratory data analysis.
Useful to know:
With Tidymodels, it’s possible to use the full capabilities of formulas described here only if you directly fit a model. With workflows, recipes only accept equations like the ones shown in this section. Interactions and transformations must be defined using preprocessing functions like step_interact
or
step_sqrt
.15
29.2 Linear models with interactions
The models in the previous section included only main effects. It is easy to extend the model definition to include interaction terms. Interactions are identified using :
, e.g. a:b
represents interaction of a
and b
. Here is an example:
y ~ a + b + c + a:b
This represents the following linear model:
\[
y = c_1 a + c_2 b + c_3 c + c_4 a b + y_0
\]
The formula can be written in a more concise way using *
:
y ~ a*b + c
The term a*b
is expanded to a + b + a:b
.
If you want to include interaction terms between several predictors, the following variations can be used:
y ~ (a + b + c)^2
y ~ (a + b + c)**2
It expands to y ~ a + b + c + a:b + a:c + b:c
.
You can extend this expression to include interactions of more than two variables. For example:
y ~ (a + b + c)^3
is equivalent to
y ~ a + b + c + a:b + a:c + b:c + a:b:C
You can only specify interactions term in
recipes
using thestep_interact
function. The function uses the same syntax, but interprets the formula slightly different. For example, you can specifya*b
which is normally interpreted asa + b + a:b
.recipes
will add the interactiona:b
but not the main effects. In practice, this makes no difference, as the main effects must be included in the recipe’s formula and therefore will be present. See 7.7 for more details.
29.3 Linear models with transformations
Formula can also be used to define variable transformations. For example,
log(y) ~ a + log(x)
defines the following linear model: \[ \log y = c_1 a + c_2 \log x + y_0 \] Here, we take the logarithm for both the outcome \(y\) and the predictor \(x\) and train a linear model using the transformed variables.
Not all transformations can be easily expressed. Consider the following linear model:
\[
y = c_1 x^2 + y_0
\]
One could be tempted to write this as y ~ x^2
. This is however interpreted as y ~ x*x
which is equivalent to y ~ x
. The correct way to formulate this model is to use the I()
function.
y ~ I(x^2)
Whatever is inside the brackets, is evaluated as an expression.
Another case for using the I()
function is this linear model:
\[
y = c_1 a + c_2 (b + c) + y_0
\]
To express this in a formula use:
y ~ a + I(b + c)
As we’ve seen for interactions, you can also only specify transformations in
recipes
using preprocessing functions. See 7 for more details and examples.
29.4 Miscellaneous
There are additional operators that you may come across. The %in% or
/operator expands
a/bor
a %in% bto
a + a:b`.
The -
operator allows to remove terms. E.g. the following formula are identical.
y ~ (a + b + c)^2 - a:b
y ~ a + b + c + a:b + a:c + b:c - a:b
y ~ a + b + c + a:c + b:C
Useful to know:
This approach to define statistical models in R, was developed by Wilkinson and Rogers in 1973 (G. N. Wilkinson and Rogers 1973). It is also available in Python using the patsy
package. Patsy
is used by the statsmodels
package and a few other Python packages.
References
You will get the following error message when you define add a formula to a recipe that contains interactions or transformations:
! No in-line functions should be used here; use steps to define baking actions.
↩︎