Chapter 28 Defining models using formulae

28.1 Linear models

In R, statistical models are usually defined using a formula. Here is an example:

y ~ x1 + x2 + x3

This formula describes a linear model of the form: \[ y = c_1 x1 + c_2 x2 + c_3 x3 + y_0 \] The outcome \(y\) is a linear combination of the predictors \(x1\), \(x2\), and \(x3\) with coefficients \(c_1\), \(c_2\), and \(c_3\). \(y_0\) is a constant intercept. Note how the intercept is not included in the model definition. If you want to make the intercept explicit you can write the following formula where 1 represents the constant, but modeled intercept \(y_0\).

y ~ x1 + x2 + x3 + 1

To exclude the intercept and fit a linear model without intercept, use one of the following options:

y ~ x1 + x2 + x3 - 1
y ~ x1 + x2 + x3 + 0

If your variable names contain spaces, you need to surround the name in the formula with the backtick character `.

quality ~ `fixed acidity` + `volatile acidity` + chlorides

Here, “fixed acidity” and “volatile acidity” are names of variables or columns in a data frame.

You will frequently come across formulas like y ~ . in the following linear regression model:

model <- lm(y ~ ., data=df)

Here, the . stands for “all columns not otherwise in the formula”. For example, if the tibble (or data frame) contains columns a, b, c, and y, then y ~ . is equivalent to the explicit formula y ~ a + b + c.

While it might be tempting to use the shortcut y ~ ., it is better to be explicit and list all terms to ensure reproducibility. Even if we know exactly what is included in a dataset when we develop the model, data can change over time. For example, if data are downloaded from an external source and columns are added, model results will change. Another source of issues will be columns added to the dataset during exploratory data analysis.

Useful to know:

With Tidymodels, it’s possible to use the full capabilities of formulas described here only if you directly fit a model. With workflows, recipes only accept equations like the ones shown in this section. Interactions and transformations must be defined using preprocessing functions like step_interact or step_sqrt.¹⁵

28.2 Linear models with interactions

The models in the previous section included only main effects. It is easy to extend the model definition to include interaction terms. Interactions are identified using :, e.g. a:b represents interaction of a and b. Here is an example:

y ~ a + b + c + a:b

This represents the following linear model: \[ y = c_1 a + c_2 b + c_3 c + c_4 a b + y_0 \] The formula can be written in a more concise way using *:

y ~ a*b + c

The term a*b is expanded to a + b + a:b.

If you want to include interaction terms between several predictors, the following variations can be used:

y ~ (a + b + c)^2
y ~ (a + b + c)**2

It expands to y ~ a + b + c + a:b + a:c + b:c.

You can extend this expression to include interactions of more than two variables. For example:

y ~ (a + b + c)^3

is equivalent to

y ~ a + b + c + a:b + a:c + b:c + a:b:C

You can only specify interactions term in recipes using the step_interact function. The function uses the same syntax, but interprets the formula slightly different. For example, you can specify a*b which is normally interpreted as a + b + a:b. recipes will add the interaction a:b but not the main effects. In practice, this makes no difference, as the main effects must be included in the recipe’s formula and therefore will be present. See 7.7 for more details.

28.3 Linear models with transformations

Formula can also be used to define variable transformations. For example,

log(y) ~ a + log(x)

defines the following linear model: \[ \log y = c_1 a + c_2 \log x + y_0 \] Here, we take the logarithm for both the outcome \(y\) and the predictor \(x\) and train a linear model using the transformed variables.

Not all transformations can be easily expressed. Consider the following linear model: \[ y = c_1 x^2 + y_0 \] One could be tempted to write this as y ~ x^2. This is however interpreted as y ~ x*x which is equivalent to y ~ x. The correct way to formulate this model is to use the I() function.

y ~ I(x^2)

Whatever is inside the brackets, is evaluated as an expression.

Another case for using the I() function is this linear model: \[ y = c_1 a + c_2 (b + c) + y_0 \] To express this in a formula use:

y ~ a + I(b + c)

As we’ve seen for interactions, you can also only specify transformations in recipes using preprocessing functions. See 7 for more details and examples.

28.4 Miscellaneous

There are additional operators that you may come across. The %in% or/operator expandsa/bora %in% btoa + a:b`.

The - operator allows to remove terms. E.g. the following formula are identical.

y ~ (a + b + c)^2 - a:b
y ~ a + b + c + a:b + a:c + b:c - a:b
y ~ a + b + c + a:c + b:C

Useful to know:

This approach to define statistical models in R, was developed by Wilkinson and Rogers in 1973 (G. N. Wilkinson and Rogers 1973). It is also available in Python using the patsy package. Patsy is used by the statsmodels package and a few other Python packages.

References

Wilkinson, G. N., and C. E. Rogers. 1973. “Symbolic Description of Factorial Models for Analysis of Variance.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 22 (3): 392–99. https://doi.org/10.2307/2346786.

You will get the following error message when you define add a formula to a recipe that contains interactions or transformations: ! No in-line functions should be used here; use steps to define baking actions.↩︎