Chapter 3 Data visualization

In this class, we will use the tidyverse package ggplots. It is based on The Grammar of Graphics by Wilkinson (L. Wilkinson 2005). The basic idea is that you can build up a plot by adding layers.

ggplot2 is loaded either with library(ggplot2) or library(tidyverse).

Code
library(tidyverse)
library(patchwork)
library(GGally)

We also load the package patchwork which allows us to combine multiple graphs into a single figure.

Here is an example of a ggplot2 graph:

Code
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg)) +
    geom_point(color="darkgreen") +
    geom_smooth(formula=y ~ x, method="lm")
Example of a ggplot2 graph

Figure 3.1: Example of a ggplot2 graph

Step 1: The ggplot command creates a new plot. The first argument is the data frame, and the second argument is the mapping. It maps the variables from the dataframe to the visual properties of the plot. In this case, we are mapping the variable wt to the \(x\)-axis and the variable mpg to the \(y\)-axis.

Step 2: The geom_point command adds the first layer to the plot. In this case, it adds darkgreen points (color argument). The layer is added using the + operator.

Step 3: The geom_smooth command adds a layer with a fit curve. The formula argument specifies the formula for the curve. The default is the linear regression y~x. The method argument specifies the method for fitting the curve. In this case, we are using the linear model. The default is to fit a loess curve or a spline fit as a function of the dataset size.

Figure 3.1 gives the resulting plot. There are many ways in which we can extend the plot. For example, we can color the points by another data property. Here, we color by the number of cylinders (cyl).

Code
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
    geom_point() +
    geom_smooth(formula=y ~ x, method="lm")
Example of a ggplot2 graph with color representing a property

Figure 3.2: Example of a ggplot2 graph with color representing a property

We also added another aesthetic mapping. The variable cyl, the number of cylinders, is mapped to the color aesthetic. 2 The color aesthetic affects the plot in several ways. It changes the color of the points and the linear regression lines. It also creates individual regression lines for each group of points. Finally, a legend is added that explains what the colors represent is added to the plot 3.2.

The graph in Figure 3.2 uses the column names wt and mpg as labels for the axis and factor(cyl) for the color information in the legend. We can provide better labels these using the labs command getting the final plot in Figure 3.3.

Code
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
    geom_point() +
    geom_smooth(formula=y ~ x, method="lm") +
    labs(title="Plot of MPG vs Weight",
        x="Weight",
        y="MPG",
        color="Number of Cylinders")
Adding labels to the plot

Figure 3.3: Adding labels to the plot

This short example should demonstrate the power and flexibility of ggplot2. It is useful to get an understanding of the full potential of ggplot2.

Todo:

In the following we will look at more examples of graphs that are useful for exploratory data analysis.

3.1 Visualizing a single variable

In exploratory data analysis, we are often interested in the distribution of single variables in a dataset. Commonly used graphs are boxplots, histograms, and density plots.

Code
library(patchwork)
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg)) +
    geom_boxplot() +
    labs(y="MPG", title="Boxplot")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
    geom_histogram(bins=20) +
    labs(x="MPG", title="Histogram")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
    geom_density() +
    labs(x="MPG", title="Density plot")

g1 + g2 + g3 + plot_layout(widths = c(1, 2, 2))
Visualizing a single variable with a boxplot, histogram, and density plot

Figure 3.4: Visualizing a single variable with a boxplot, histogram, and density plot

Figure 3.4 shows the three plots. The first plot is a boxplot (geom_boxplot). It shows the median, the first and third quartile, and the minimum and maximum values. The variable of interest is mapped onto the y axis. This is different from the histogram and densityplot where the variable is mapped onto the x axis.3

The second plot is a histogram (geom_histogram). It shows the distribution of the data. Note that executing the code gave a warning. By default, the graph uses 30 bins, which may be fine for your data. However, it is often useful to experiment with different bin sizes (binwidth) or counts (bins) and see how the graph changes. It can be helpful to also change the position of the bins using center or boundary.

The third plot is a density plot. It is similar to a histogram but uses a smooth curve instead of bars. Similar to histograms, the shape of density plots can be controlled using arguments. The bw argument controls the smoothness of the density plot. By default, a bandwidth is chosen automatically from the data using one of several approaches. nrd0 (Silverman 1986) or nrd (Scott 1992) are good choices. The adjust argument (default 1) can be used to adjust this automatically determined bandwidth.

Useful to know:

In Figure 3.4, we combined multiple plots into a single figure using the patchwork package. We create three plots g1, g2, and g3 and then combine them using the + operator. The plot_layout function is used to control the relative sizes here. You will find more examples of this throughout the book.

Sometimes you will be interested in separating the data by a factor. For example, you may want to compare the distribution of the mpg variable for different numbers of cylinders. Figure 3.5 shows the same three plots as before but now grouped by the number of cylinders.

Code
mtcars <- datasets::mtcars %>% mutate(cyl=as.factor(cyl))
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg, x=cyl, color=cyl)) +
    geom_boxplot() +
    labs(x="Cylinders", y="MPG", title="Boxplot") +
    theme(legend.position="none")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_histogram(bins=20) +
    labs(x="MPG", title="Stacked histogram") +
    theme(legend.position="none")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_histogram(bins=20, alpha=0.5, position="identity") +
    labs(x="MPG", title="Histogram") +
    theme(legend.position="none")
g4 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_density(alpha=0.5) +
    labs(x="MPG", title="Density plot") +
    theme(legend.position="none")

g1 + g2 + g3 + g4 + plot_layout(widths = c(1, 1, 1, 1))
Visualizing a single variable with a boxplot, histogram, and density plot grouped by a factor

Figure 3.5: Visualizing a single variable with a boxplot, histogram, and density plot grouped by a factor

Boxplot: We map the cyl factor both to the y and the color aesthetic. This creates a separate boxplot for each level of the factor.

Stacked histogram: For the histogram, we map the factor to the fill aesthetic. This creates a stacked histogram.

Histogram: To create a histogram for each level of the factor, we need to set the position argument to "identity". This creates a separate histogram for each level of the factor. To avoid the histograms being plotted on top of each other, we set the alpha argument to 0.5. This makes the histograms transparent.

Densityplot: The densityplot is similar to the histogram. We map the factor to the fill aesthetic and set the alpha argument to 0.5 to create overlayed densityplots for each level of the factor. The position argument has the same effect as for geom_histogram. The difference is that default are overlayed densityplots. Using position="stack" creates stacked densityplots.

By changing the x and y mapping, the boxplot can be made horizontal. See Figure 3.6.

Code
g <- ggplot(data=mtcars, mapping=aes(x=mpg, y=cyl, color=cyl)) +
    geom_boxplot() +
    labs(x="MPG", y="Cylinders", title="Boxplot") +
    theme(legend.position="none")
g
Horizontal boxplot grouped by a factor

Figure 3.6: Horizontal boxplot grouped by a factor

3.2 Visualizing two variables

The introductory example showed the relationship between two variables using a scatterplot. Scatterplots are a good choice if the number of data points isn’t too large. If the number of points gets larger, data points will be shown on top of each other. In this case, using transparent points, will reveal the density of the data. The argument alpha changes the transparency. alpha=1 is the default no transparency. Reducing it increases the transparency; 0 makes the point invisible. A good starting point is 0.5. Always try a variety of alpha values to see which one works best for your data. See Figure 3.7 that demonstrates the effect of adding transparency.

Code
auto <- ISLR2::Auto %>%
    mutate(cylinders=as.factor(cylinders))

g1_1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_point() +
    labs(x="Weight", y="MPG", title="Scatterplot")
g1_2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_point(alpha=0.5) +
    labs(x="Weight", y="MPG", title="Scatterplot with transparency")

g1_1 + g1_2
Using transparency if overplotting occurs

Figure 3.7: Using transparency if overplotting occurs

For very large datasets, it is better to use a heatmap or a two-dimensional density plot. Figure 3.8 shows the two versions of the heatmap for the ISLR2::Auto dataset.

Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_bin_2d(bins=15) +
    labs(x="Weight", y="MPG", title="Rectangular heatmap")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_hex(bins=15) +
    scale_fill_viridis_c(direction=-1) +
    labs(x="Weight", y="MPG", title="Hexagonal heatmap")

g1 + g2
Visualizing two variables with a heatmap

Figure 3.8: Visualizing two variables with a heatmap

The functions geom_bin_2d and geom_hex create heatmap representations of the distribution. The first uses rectangular, the second hexagonal patches. Use the bins argument to change the number of bins in a direction.Similar to histograms, try different values for bins for your data. There are other arguments to control binning. In the second example, we use a different colormap. Check the documentation for details.

By default, the color represents the count of data points in a bin. If you want to use a value, e.g. the average of a variable, you can use the stat_summary_2d function. See Figure 3.9 for an example.

Code
ggplot(data=auto, mapping=aes(x=weight, y=displacement)) +
    stat_summary_hex(aes(z=mpg), bins=10, fun=mean) +
    scale_fill_viridis_c(direction=-1) +
    geom_point() +
    labs(x="Weight", y="Displacement")
Color heatmap by average of `mpg`

Figure 3.9: Color heatmap by average of mpg

Examples for two dimensional density plots are shown in Figure 3.10. The geom_density_2d function adds the density plot layer.

Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d() +
    geom_point(size=0.5, color="darkblue") +
    labs(x="Weight", y="MPG", title="Two-dimensional density")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg, color=cylinders)) +
    geom_point(size=0.5) +
    geom_density_2d() +
    scale_colour_brewer(palette="Set1") +
    labs(title="Two-dimensional density by categorical variable",
        x="Weight", y="MPG", color="Number of Cylinders")

g1 + g2
Visualizing two variables with a density plot

Figure 3.10: Visualizing two variables with a density plot

The left graph shows the density using contour lines. The right graph overlays individual density contours for each of the subsets formed by the categorical variable cylinders. The scale_colour_brewer function selects the colors. The palette argument specifies the color palette. Set1 is a good choice for categorical variables.

Useful to know:

The “Brewer” color scales are based on the work of Cynthia Brewer who designed color palettes for different use cases. While it was initially developed for coloring maps, the various palettes have become popular options for coloring graphs. You can go to https://colorbrewer2.org/ to explore this more.

We can add filled contour lines using the function geom_density_2d_filled. See Figure 3.11.

Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d_filled(alpha=0.5) +
    geom_point(size=0.5) +
    labs(title="Two-dimensional density (filled)", x="Weight", y="MPG") +
    theme(legend.position="none")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d_filled() + #contour_var = "ndensity", bins = 5) +
    geom_point(size=0.5) +
    scale_fill_brewer() +
    labs(title="Alternative color scheme", x="Weight", y="MPG") +
    theme(legend.position="none")

g1 + g2
Filled two-dimensional density

Figure 3.11: Filled two-dimensional density

3.3 Visualizing multiple variables

One option to visualize multiple variables in a graph is a pairplot. Figure 3.12 uses the ggpairs function from the GGally package.

Code
pair_auto <- ISLR2::Auto %>%
    mutate(
        cylinders=as.factor(cylinders),
        origin=as.factor(origin),
    ) %>%
    select(-name)
ggpairs(pair_auto,
    lower=list(combo=wrap("facethist", binwidth=0.5)))
Pair plot

Figure 3.12: Pair plot

A pairplot shows visualizations of pairs of variables in a compact presentation. By default, ggpairs uses densityplots and bar charts along the diagonal to show the distribution of continuous and categorical variables. The upper and lower triangle visualizations depend on the type of the two variables. If both variables are are continuous, the upper triangle shows their correlation as a value and the lower triangle a scatterplot. If one variable is continuous and the other is categorical, the upper triangle uses boxplots and the lower triangle a bar chart. If both variables are categorical, the lower triangle shows a bar chart and the upper triangle a type of two dimensional bar chart.

Figure 3.13 shows an alternative visualization of multiple variables. The ggparcoord function from the GGally package creates a parallel coordinate plot. It shows the values of each data point as a line.

Code
g1 <- pair_auto %>%
    ggparcoord(columns=1:7, groupColumn=8)
g2 <- pair_auto %>%
    ggparcoord(columns=c(2:5, 7, 6, 1), groupColumn=8, alpha=0.5, splineFactor=10)
g1 + g2
Parallel coordinate plot. Lines connect the values of each data point and are colored by the origin of the car.

Figure 3.13: Parallel coordinate plot. Lines connect the values of each data point and are colored by the origin of the car.

Parallel coordinate plots can be hard to read and it is worth exerimenting with different orderings of the variables and other settings. Here, we used alpha to make the lines transparent which helps for larger datasets. The splineFactor argument controls the smoothness of the lines. Without smoothing, the variable values would be connected by straight lines. This separates the lines and makes it easier to see the distribution of the data. It also adds information on how the coordinates to the left (mpg) and the right (displacement) are connected. The groupColumn argument is used to color the lines by the origin of the car.

Parallel coordinate plots also benefit greatly from interactivity. You can use the plotly package to create interactive parallel coordinate plots (see https://plotly.com/r/parallel-coordinates-plot/ for examples).

3.4 Saving plots to file

You can save plots to file using the ggsave function. The following example saves the scatterplot from Figure 3.11 to a png file.

Code
ggsave(filename="example.png", plot=g1 + g2,
    width=8, height=4, units="in", dpi=300)

Here is the saved figure:

Code
knitr::include_graphics("example.png")
Saved figure

Figure 3.14: Saved figure

3.5 autoplot and autolayer functions

Some R packages provide autoplot functions that create ggplot2 graphs. If available, these functions are useful for quickly visualizing special data or the result of calculations. The functions return a ggplot2 graph that can be further customized using the methods shown in this chapter. Packages that implement the autoplot function often also provide an autolayer function. This function adds a layer with a specialized visualization to an existing ggplot2 graph.

In this book, we use autoplot functions to visualize the results of model tuning (see Chapter 14) and ROC curves (see Section 10.3).

For example, the forecast package provides autoplot and autolayer functions for time series objects. Figure 3.15 shows how the autoplot function selects an appropriate axis scale for time series data.

Code
library(forecast)
autoplot(AirPassengers) +
    autolayer(seasadj(decompose(AirPassengers, "multiplicative"))) +
    theme(legend.position="none")
Example of an autoplot: air passenger counts (black) with seasonal adjustment (red)

Figure 3.15: Example of an autoplot: air passenger counts (black) with seasonal adjustment (red)

Further information:

Code

The code of this chapter is summarized here.

Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)
library(patchwork)
library(GGally)
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg)) +
    geom_point(color="darkgreen") +
    geom_smooth(formula=y ~ x, method="lm")
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
    geom_point() +
    geom_smooth(formula=y ~ x, method="lm")
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
    geom_point() +
    geom_smooth(formula=y ~ x, method="lm") +
    labs(title="Plot of MPG vs Weight",
        x="Weight",
        y="MPG",
        color="Number of Cylinders")
library(patchwork)
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg)) +
    geom_boxplot() +
    labs(y="MPG", title="Boxplot")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
    geom_histogram(bins=20) +
    labs(x="MPG", title="Histogram")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
    geom_density() +
    labs(x="MPG", title="Density plot")

g1 + g2 + g3 + plot_layout(widths = c(1, 2, 2))
mtcars <- datasets::mtcars %>% mutate(cyl=as.factor(cyl))
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg, x=cyl, color=cyl)) +
    geom_boxplot() +
    labs(x="Cylinders", y="MPG", title="Boxplot") +
    theme(legend.position="none")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_histogram(bins=20) +
    labs(x="MPG", title="Stacked histogram") +
    theme(legend.position="none")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_histogram(bins=20, alpha=0.5, position="identity") +
    labs(x="MPG", title="Histogram") +
    theme(legend.position="none")
g4 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
    geom_density(alpha=0.5) +
    labs(x="MPG", title="Density plot") +
    theme(legend.position="none")

g1 + g2 + g3 + g4 + plot_layout(widths = c(1, 1, 1, 1))
g <- ggplot(data=mtcars, mapping=aes(x=mpg, y=cyl, color=cyl)) +
    geom_boxplot() +
    labs(x="MPG", y="Cylinders", title="Boxplot") +
    theme(legend.position="none")
g
auto <- ISLR2::Auto %>%
    mutate(cylinders=as.factor(cylinders))

g1_1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_point() +
    labs(x="Weight", y="MPG", title="Scatterplot")
g1_2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_point(alpha=0.5) +
    labs(x="Weight", y="MPG", title="Scatterplot with transparency")

g1_1 + g1_2
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_bin_2d(bins=15) +
    labs(x="Weight", y="MPG", title="Rectangular heatmap")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_hex(bins=15) +
    scale_fill_viridis_c(direction=-1) +
    labs(x="Weight", y="MPG", title="Hexagonal heatmap")

g1 + g2
ggplot(data=auto, mapping=aes(x=weight, y=displacement)) +
    stat_summary_hex(aes(z=mpg), bins=10, fun=mean) +
    scale_fill_viridis_c(direction=-1) +
    geom_point() +
    labs(x="Weight", y="Displacement")
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d() +
    geom_point(size=0.5, color="darkblue") +
    labs(x="Weight", y="MPG", title="Two-dimensional density")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg, color=cylinders)) +
    geom_point(size=0.5) +
    geom_density_2d() +
    scale_colour_brewer(palette="Set1") +
    labs(title="Two-dimensional density by categorical variable",
        x="Weight", y="MPG", color="Number of Cylinders")

g1 + g2
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d_filled(alpha=0.5) +
    geom_point(size=0.5) +
    labs(title="Two-dimensional density (filled)", x="Weight", y="MPG") +
    theme(legend.position="none")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
    geom_density_2d_filled() + #contour_var = "ndensity", bins = 5) +
    geom_point(size=0.5) +
    scale_fill_brewer() +
    labs(title="Alternative color scheme", x="Weight", y="MPG") +
    theme(legend.position="none")

g1 + g2
pair_auto <- ISLR2::Auto %>%
    mutate(
        cylinders=as.factor(cylinders),
        origin=as.factor(origin),
    ) %>%
    select(-name)
ggpairs(pair_auto,
    lower=list(combo=wrap("facethist", binwidth=0.5)))
g1 <- pair_auto %>%
    ggparcoord(columns=1:7, groupColumn=8)
g2 <- pair_auto %>%
    ggparcoord(columns=c(2:5, 7, 6, 1), groupColumn=8, alpha=0.5, splineFactor=10)
g1 + g2
ggsave(filename="example.png", plot=g1 + g2,
    width=8, height=4, units="in", dpi=300)
knitr::include_graphics("example.png")
library(forecast)
autoplot(AirPassengers) +
    autolayer(seasadj(decompose(AirPassengers, "multiplicative"))) +
    theme(legend.position="none")

References

Scott, David W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. 1st edition. New York: John Wiley & Sons.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Boca Raton: Chapman and Hall.
Wilkinson, Leland. 2005. The Grammar of Graphics. Statistics and Computing. New York: Springer-Verlag. https://doi.org/10.1007/0-387-28695-0.

  1. Note: we convert the cyl variable to a factor. It would be better to do this at the data preprocessing stage.↩︎

  2. You could map the variable also onto the x aesthetic. This would give you a horizontal boxplot.↩︎