Chapter 3 Data visualization
In this class, we will use the tidyverse package ggplots
. It is based on The Grammar of Graphics by Wilkinson (L. Wilkinson 2005). The basic idea is that you can build up a plot by adding layers.
ggplot2
is loaded either with library(ggplot2)
or library(tidyverse)
.
We also load the package patchwork
which allows us to combine multiple graphs into a single figure.
Here is an example of a ggplot2
graph:
Code
Step 1: The ggplot
command creates a new plot. The first argument is the data frame, and the second argument is the mapping. It maps the variables from the dataframe to the visual properties of the plot. In this case, we are mapping the variable wt
to the \(x\)-axis and the variable mpg
to the \(y\)-axis.
Step 2: The geom_point
command adds the first layer to the plot. In this case, it adds darkgreen points (color
argument). The layer is added using the +
operator.
Step 3: The geom_smooth
command adds a layer with a fit curve. The formula
argument specifies the formula for the curve. The default is the linear regression y~x
. The method
argument specifies the method for fitting the curve. In this case, we are using the linear model. The default is to fit a loess curve or a spline fit as a function of the dataset size.
Figure 3.1 gives the resulting plot. There are many ways in which we can extend the plot. For example, we can color the points by another data property. Here, we color by the number of cylinders (cyl
).
Code
We also added another aesthetic mapping. The variable cyl
, the number of cylinders, is mapped to the color
aesthetic. 2 The color aesthetic affects the plot in several ways. It changes the color of the points and the linear regression lines. It also creates individual regression lines for each group of points. Finally, a legend is added that explains what the colors represent is added to the plot 3.2.
The graph in Figure 3.2 uses the column names wt
and mpg
as labels for the axis and factor(cyl)
for the color information in the legend. We can provide better labels these using the labs
command getting the final plot in Figure 3.3.
Code
This short example should demonstrate the power and flexibility of ggplot2
. It is useful to get an understanding of the full potential of ggplot2
.
Todo:
- Go to the ggplot2 website at https://ggplot2.tidyverse.org/ and look at the Reference section.
- Visit the R graph gallery at https://r-graph-gallery.com/ggplot2-package.html to get an overview of the different types of plots that can be created with
ggplot2
.
In the following we will look at more examples of graphs that are useful for exploratory data analysis.
3.1 Visualizing a single variable
In exploratory data analysis, we are often interested in the distribution of single variables in a dataset. Commonly used graphs are boxplots, histograms, and density plots.
Code
library(patchwork)
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg)) +
geom_boxplot() +
labs(y="MPG", title="Boxplot")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
geom_histogram(bins=20) +
labs(x="MPG", title="Histogram")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
geom_density() +
labs(x="MPG", title="Density plot")
g1 + g2 + g3 + plot_layout(widths = c(1, 2, 2))
Figure 3.4 shows the three plots. The first plot is a boxplot (geom_boxplot
). It shows the median, the first and third quartile, and the minimum and maximum values. The variable of interest is mapped onto the y
axis. This is different from the histogram and densityplot where the variable is mapped onto the x
axis.3
The second plot is a histogram (geom_histogram
). It shows the distribution of the data. Note that executing the code gave a warning. By default, the graph uses 30 bins, which may be fine for your data. However, it is often useful to experiment with different bin sizes (binwidth
) or counts (bins
) and see how the graph changes. It can be helpful to also change the position of the bins using center
or boundary
.
The third plot is a density plot. It is similar to a histogram but uses a smooth curve instead of bars. Similar to histograms, the shape of density plots can be controlled using arguments. The bw
argument controls the smoothness of the density plot. By default, a bandwidth is chosen automatically from the data using one of several approaches. nrd0
(Silverman 1986) or nrd
(Scott 1992) are good choices. The adjust
argument (default 1) can be used to adjust this automatically determined bandwidth.
Useful to know:
In Figure 3.4, we combined multiple plots into a single figure using the patchwork
package. We create three plots g1
, g2
, and g3
and then combine them using the +
operator. The plot_layout
function is used to control the relative sizes here. You will find more examples of this throughout the book.
Sometimes you will be interested in separating the data by a factor. For example, you may want to compare the distribution of the mpg
variable for different numbers of cylinders. Figure 3.5 shows the same three plots as before but now grouped by the number of cylinders.
Code
mtcars <- datasets::mtcars %>% mutate(cyl=as.factor(cyl))
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg, x=cyl, color=cyl)) +
geom_boxplot() +
labs(x="Cylinders", y="MPG", title="Boxplot") +
theme(legend.position="none")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_histogram(bins=20) +
labs(x="MPG", title="Stacked histogram") +
theme(legend.position="none")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_histogram(bins=20, alpha=0.5, position="identity") +
labs(x="MPG", title="Histogram") +
theme(legend.position="none")
g4 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_density(alpha=0.5) +
labs(x="MPG", title="Density plot") +
theme(legend.position="none")
g1 + g2 + g3 + g4 + plot_layout(widths = c(1, 1, 1, 1))
Boxplot: We map the cyl
factor both to the y
and the color
aesthetic. This creates a separate boxplot for each level of the factor.
Stacked histogram: For the histogram, we map the factor to the fill
aesthetic. This creates a stacked histogram.
Histogram: To create a histogram for each level of the factor, we need to set the position
argument to "identity"
. This creates a separate histogram for each level of the factor. To avoid the histograms being plotted on top of each other, we set the alpha
argument to 0.5. This makes the histograms transparent.
Densityplot: The densityplot is similar to the histogram. We map the factor to the fill
aesthetic and set the alpha
argument to 0.5 to create overlayed densityplots for each level of the factor. The position
argument has the same effect as for geom_histogram
. The difference is that default are overlayed densityplots. Using position="stack"
creates stacked densityplots.
By changing the x
and y
mapping, the boxplot can be made horizontal. See Figure 3.6.
Code
3.2 Visualizing two variables
The introductory example showed the relationship between two variables using a scatterplot. Scatterplots are a good choice if the number of data points isn’t too large. If the number of points gets larger, data points will be shown on top of each other. In this case, using transparent points, will reveal the density of the data. The argument alpha
changes the transparency. alpha=1
is the default no transparency. Reducing it increases the transparency; 0 makes the point invisible. A good starting point is 0.5. Always try a variety of alpha values to see which one works best for your data. See Figure 3.7 that demonstrates the effect of adding transparency.
Code
auto <- ISLR2::Auto %>%
mutate(cylinders=as.factor(cylinders))
g1_1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_point() +
labs(x="Weight", y="MPG", title="Scatterplot")
g1_2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_point(alpha=0.5) +
labs(x="Weight", y="MPG", title="Scatterplot with transparency")
g1_1 + g1_2
For very large datasets, it is better to use a heatmap or a two-dimensional density plot. Figure 3.8 shows the two versions of the heatmap for the ISLR2::Auto
dataset.
Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_bin_2d(bins=15) +
labs(x="Weight", y="MPG", title="Rectangular heatmap")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_hex(bins=15) +
scale_fill_viridis_c(direction=-1) +
labs(x="Weight", y="MPG", title="Hexagonal heatmap")
g1 + g2
The functions geom_bin_2d
and geom_hex
create heatmap representations of the distribution. The first uses rectangular, the second hexagonal patches. Use the bins
argument to change the number of bins in a direction.Similar to histograms, try different values for bins for your data. There are other arguments to control binning. In the second example, we use a different colormap. Check the documentation for details.
By default, the color represents the count of data points in a bin. If you want to use a value, e.g. the average of a variable, you can use the stat_summary_2d
function. See Figure 3.9 for an example.
Code
Examples for two dimensional density plots are shown in Figure 3.10. The geom_density_2d
function adds the density plot layer.
Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d() +
geom_point(size=0.5, color="darkblue") +
labs(x="Weight", y="MPG", title="Two-dimensional density")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg, color=cylinders)) +
geom_point(size=0.5) +
geom_density_2d() +
scale_colour_brewer(palette="Set1") +
labs(title="Two-dimensional density by categorical variable",
x="Weight", y="MPG", color="Number of Cylinders")
g1 + g2
The left graph shows the density using contour lines. The right graph overlays individual density contours for each of the subsets formed by the categorical variable cylinders
. The scale_colour_brewer
function selects the colors. The palette
argument specifies the color palette. Set1
is a good choice for categorical variables.
Useful to know:
The “Brewer” color scales are based on the work of Cynthia Brewer who designed color palettes for different use cases. While it was initially developed for coloring maps, the various palettes have become popular options for coloring graphs. You can go to https://colorbrewer2.org/ to explore this more.
We can add filled contour lines using the function geom_density_2d_filled
. See Figure 3.11.
Code
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d_filled(alpha=0.5) +
geom_point(size=0.5) +
labs(title="Two-dimensional density (filled)", x="Weight", y="MPG") +
theme(legend.position="none")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d_filled() + #contour_var = "ndensity", bins = 5) +
geom_point(size=0.5) +
scale_fill_brewer() +
labs(title="Alternative color scheme", x="Weight", y="MPG") +
theme(legend.position="none")
g1 + g2
3.3 Visualizing multiple variables
One option to visualize multiple variables in a graph is a pairplot. Figure 3.12 uses the ggpairs
function from the GGally
package.
Code
A pairplot shows visualizations of pairs of variables in a compact presentation. By default, ggpairs
uses densityplots and bar charts along the diagonal to show the distribution of continuous and categorical variables. The upper and lower triangle visualizations depend on the type of the two variables. If both variables are are continuous, the upper triangle shows their correlation as a value and the lower triangle a scatterplot. If one variable is continuous and the other is categorical, the upper triangle uses boxplots and the lower triangle a bar chart. If both variables are categorical, the lower triangle shows a bar chart and the upper triangle a type of two dimensional bar chart.
Figure 3.13 shows an alternative visualization of multiple variables. The ggparcoord
function from the GGally
package creates a parallel coordinate plot. It shows the values of each data point as a line.
Code
Parallel coordinate plots can be hard to read and it is worth exerimenting with different orderings of the variables and other settings. Here, we used alpha
to make the lines transparent which helps for larger datasets. The splineFactor
argument controls the smoothness of the lines. Without smoothing, the variable values would be connected by straight lines. This separates the lines and makes it easier to see the distribution of the data. It also adds information on how the coordinates to the left (mpg
) and the right (displacement
) are connected. The groupColumn
argument is used to color the lines by the origin of the car.
Parallel coordinate plots also benefit greatly from interactivity. You can use the plotly
package to create interactive parallel coordinate plots (see https://plotly.com/r/parallel-coordinates-plot/ for examples).
3.4 Saving plots to file
You can save plots to file using the ggsave
function. The following example saves the scatterplot from Figure 3.11 to a png file.
Here is the saved figure:
3.5 autoplot
and autolayer
functions
Some R packages provide autoplot
functions that create ggplot2
graphs. If available, these functions are useful for quickly visualizing special data or the result of calculations. The functions return a ggplot2
graph that can be further customized using the methods shown in this chapter. Packages that implement the autoplot
function often also provide an autolayer
function. This function adds a layer with a specialized visualization to an existing ggplot2
graph.
In this book, we use autoplot
functions to visualize the results of model tuning (see Chapter 14) and ROC curves (see Section 10.3).
For example, the forecast
package provides autoplot
and autolayer
functions for time series objects. Figure 3.15 shows how the autoplot
function selects an appropriate axis scale for time series data.
Code
Further information:
- The ggplot2 cheatsheet is a two-page summary of all the main features of ggplot2.
- For more details about ggplot2, see the main ggplot2 website at https://ggplot2.tidyverse.org/.
- The R graph gallery provides an overview of the different types of plots that can be created with ggplot2.
- ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham et al. is the definitive guide to ggplot2.
- The R Graphics Cookbook by Winston Chang is a great resource for learning how to create different types of plots in R.
- https://ggobi.github.io/ggally/ is the website of the
GGally
package. It provides a number of useful functions for creating more complex plots withggplot2
.
Code
The code of this chapter is summarized here.
Code
knitr::opts_chunk$set(echo=TRUE, cache=TRUE, autodep=TRUE, fig.align="center")
library(tidyverse)
library(patchwork)
library(GGally)
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg)) +
geom_point(color="darkgreen") +
geom_smooth(formula=y ~ x, method="lm")
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
geom_point() +
geom_smooth(formula=y ~ x, method="lm")
ggplot(data=mtcars, mapping=aes(x=wt, y=mpg, color=factor(cyl))) +
geom_point() +
geom_smooth(formula=y ~ x, method="lm") +
labs(title="Plot of MPG vs Weight",
x="Weight",
y="MPG",
color="Number of Cylinders")
library(patchwork)
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg)) +
geom_boxplot() +
labs(y="MPG", title="Boxplot")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
geom_histogram(bins=20) +
labs(x="MPG", title="Histogram")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg)) +
geom_density() +
labs(x="MPG", title="Density plot")
g1 + g2 + g3 + plot_layout(widths = c(1, 2, 2))
mtcars <- datasets::mtcars %>% mutate(cyl=as.factor(cyl))
g1 <- ggplot(data=mtcars, mapping=aes(y=mpg, x=cyl, color=cyl)) +
geom_boxplot() +
labs(x="Cylinders", y="MPG", title="Boxplot") +
theme(legend.position="none")
g2 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_histogram(bins=20) +
labs(x="MPG", title="Stacked histogram") +
theme(legend.position="none")
g3 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_histogram(bins=20, alpha=0.5, position="identity") +
labs(x="MPG", title="Histogram") +
theme(legend.position="none")
g4 <- ggplot(data=mtcars, mapping=aes(x=mpg, fill=cyl)) +
geom_density(alpha=0.5) +
labs(x="MPG", title="Density plot") +
theme(legend.position="none")
g1 + g2 + g3 + g4 + plot_layout(widths = c(1, 1, 1, 1))
g <- ggplot(data=mtcars, mapping=aes(x=mpg, y=cyl, color=cyl)) +
geom_boxplot() +
labs(x="MPG", y="Cylinders", title="Boxplot") +
theme(legend.position="none")
g
auto <- ISLR2::Auto %>%
mutate(cylinders=as.factor(cylinders))
g1_1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_point() +
labs(x="Weight", y="MPG", title="Scatterplot")
g1_2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_point(alpha=0.5) +
labs(x="Weight", y="MPG", title="Scatterplot with transparency")
g1_1 + g1_2
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_bin_2d(bins=15) +
labs(x="Weight", y="MPG", title="Rectangular heatmap")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_hex(bins=15) +
scale_fill_viridis_c(direction=-1) +
labs(x="Weight", y="MPG", title="Hexagonal heatmap")
g1 + g2
ggplot(data=auto, mapping=aes(x=weight, y=displacement)) +
stat_summary_hex(aes(z=mpg), bins=10, fun=mean) +
scale_fill_viridis_c(direction=-1) +
geom_point() +
labs(x="Weight", y="Displacement")
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d() +
geom_point(size=0.5, color="darkblue") +
labs(x="Weight", y="MPG", title="Two-dimensional density")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg, color=cylinders)) +
geom_point(size=0.5) +
geom_density_2d() +
scale_colour_brewer(palette="Set1") +
labs(title="Two-dimensional density by categorical variable",
x="Weight", y="MPG", color="Number of Cylinders")
g1 + g2
g1 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d_filled(alpha=0.5) +
geom_point(size=0.5) +
labs(title="Two-dimensional density (filled)", x="Weight", y="MPG") +
theme(legend.position="none")
g2 <- ggplot(data=auto, mapping=aes(x=weight, y=mpg)) +
geom_density_2d_filled() + #contour_var = "ndensity", bins = 5) +
geom_point(size=0.5) +
scale_fill_brewer() +
labs(title="Alternative color scheme", x="Weight", y="MPG") +
theme(legend.position="none")
g1 + g2
pair_auto <- ISLR2::Auto %>%
mutate(
cylinders=as.factor(cylinders),
origin=as.factor(origin),
) %>%
select(-name)
ggpairs(pair_auto,
lower=list(combo=wrap("facethist", binwidth=0.5)))
g1 <- pair_auto %>%
ggparcoord(columns=1:7, groupColumn=8)
g2 <- pair_auto %>%
ggparcoord(columns=c(2:5, 7, 6, 1), groupColumn=8, alpha=0.5, splineFactor=10)
g1 + g2
ggsave(filename="example.png", plot=g1 + g2,
width=8, height=4, units="in", dpi=300)
knitr::include_graphics("example.png")
library(forecast)
autoplot(AirPassengers) +
autolayer(seasadj(decompose(AirPassengers, "multiplicative"))) +
theme(legend.position="none")