Week 10: Visualizing Data with ggplot2

Unlock the power of the Grammar of Graphics to create insightful plots.

Explore Chapter 10

Chapter 10: Introduction to Data Visualization with `ggplot2`

The Grammar of Graphics.

`ggplot2` is based on the "Grammar of Graphics", a concept developed by Leland Wilkinson. This grammar provides a framework for thinking about and building plots layer by layer. Instead of thinking in terms of specific chart types (like scatter plot, bar chart), you think about the components that make up a graphic:

  • Data: The dataset containing the variables you want to plot.
  • Aesthetics (`aes`): How variables in your data are mapped to visual properties (aesthetics) of the plot. Examples include mapping variables to x-position, y-position, color, size, shape, fill, etc.
  • Geometric Objects (`geoms`): The visual elements used to represent the data points (e.g., points, lines, bars, boxes).
  • Facets: Creating subplots based on subsets of the data.
  • Statistics (`stats`): Statistical transformations applied to the data (e.g., calculating counts for a bar chart, smoothing lines).
  • Coordinates (`coords`): The coordinate system used (e.g., Cartesian, polar).
  • Theme: Controls the non-data elements of the plot (e.g., background, grid lines, fonts, titles).

By combining these components, you can create a vast array of customized and informative visualizations.

Introduction to `ggplot2`.

`ggplot2` is the premier data visualization package in the Tidyverse and one of the most popular in R overall. It implements the Grammar of Graphics, allowing you to build complex plots iteratively.

Installation and Loading

Like other packages, `ggplot2` needs to be installed once and loaded in each R session where you want to use it. It's included in the `tidyverse` package.

# Install ggplot2 (if not already installed via tidyverse)
# install.packages("ggplot2")

# Install the complete Tidyverse (includes ggplot2)
# install.packages("tidyverse")

# Load the package
library(ggplot2)
# Or, if you installed the whole suite:
# library(tidyverse)

Basic Plotting Template

A typical `ggplot2` call follows this structure:

ggplot(data = <DATA_FRAME>, mapping = aes(<AESTHETIC_MAPPINGS>)) +
  <GEOM_FUNCTION>()
  • `ggplot()`: Initializes the plot, specifying the default dataset and optionally the default aesthetic mappings.
  • `aes()`: Defines the aesthetic mappings – how variables map to visual properties.
  • `+`: Layers are added to the plot using the plus sign.
  • `()`: Specifies the type of geometric object to draw (e.g., `geom_point()`, `geom_bar()`).

We'll use the built-in `mpg` dataset (available when `ggplot2` is loaded) for examples. It contains fuel economy data for cars.

# View structure of mpg dataset
str(mpg)

`ggplot()` and `aes()`: Data and Aesthetics.

`ggplot()`

The `ggplot()` function creates the initial plot object. Its most important argument is `data`, which specifies the data frame containing the variables you want to plot.

# Initialize a plot using the mpg dataset
ggplot(data = mpg) # Creates an empty grey background (no layers yet)

`aes()` - Aesthetic Mappings

The `aes()` function defines how variables from your data frame are mapped to visual properties (aesthetics) of the plot layers (geoms). Common aesthetics include:

  • `x`: Position on the x-axis.
  • `y`: Position on the y-axis.
  • `color` (or `colour`): Color of points, lines, outlines.
  • `fill`: Fill color of shapes like bars, boxes.
  • `size`: Size of points or thickness of lines.
  • `shape`: Shape of points.
  • `alpha`: Transparency.

You specify mappings within `aes()` like `aes(x = variable1, y = variable2, color = variable3)`.

# Initialize plot, mapping 'displ' (engine displacement) to x-axis
# and 'hwy' (highway miles per gallon) to y-axis
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
# Still an empty plot, but R now knows which variables map to axes

Mappings defined in the main `ggplot()` call are inherited by subsequent geom layers unless overridden.

Geometric Objects (Geoms): Representing Data Visually.

Geoms determine how the data is actually represented on the plot. You add geoms as layers using the `+` operator. Each `geom_*()` function requires aesthetic mappings (defined either in `ggplot()` or within the geom itself).

`geom_point()`: Scatter Plots

Used to create scatter plots, showing the relationship between two continuous variables.

# Scatter plot of engine displacement vs highway mpg
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

# Map 'class' variable to point color
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point()

# Map 'cyl' (cylinders) to point size (use sparingly)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, size = cyl)) +
  geom_point(alpha = 0.5) # Add transparency

`geom_smooth()`: Adding Trend Lines

Often used with `geom_point()` to show trends in the data.

# Scatter plot with a smoothed conditional mean line
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth() # Uses a default smoothing method

`geom_bar()` / `geom_col()`: Bar Charts

`geom_bar()` automatically counts occurrences of categories on the x-axis (`stat="count"` is default). `geom_col()` is used when you have a variable representing the bar height already.

# Bar chart of car counts by 'class'
ggplot(data = mpg, mapping = aes(x = class)) +
  geom_bar()

# Bar chart with 'drv' (drive type) mapped to fill color
ggplot(data = mpg, mapping = aes(x = class, fill = drv)) +
  geom_bar(position = "dodge") # Dodge places bars side-by-side

Other Common Geoms

  • `geom_line()`: For line graphs (connects points in order).
  • `geom_boxplot()`: For box-and-whisker plots summarizing distributions.
  • `geom_histogram()`: For visualizing the distribution of a single continuous variable.
  • `geom_density()`: For smoothed density estimates.
  • `geom_text()` / `geom_label()`: For adding text annotations.

Experimenting with different geoms and aesthetic mappings is key to effective data visualization with `ggplot2`!

Syllabus