Week 11: Polishing Plots and Basic Statistics

Learn to customize ggplot2 visualizations, save them, and perform basic statistical analysis in R.

Explore Chapter 11

Chapter 11: Refining Visualizations & Exploring Data Summaries

Customizing `ggplot2` Plots.

While `ggplot2` produces sensible defaults, you'll often want to customize plots for clarity, publication, or specific emphasis. You add customization layers using the `+` operator.

Labels and Titles

The `labs()` function is the primary way to add titles, subtitles, captions, and change axis/legend labels.

library(ggplot2)

# Base plot from last week
p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

# Add labels
p +
  labs(
    title = "Fuel Efficiency vs. Engine Displacement",
    subtitle = "Data from ggplot2 mpg dataset",
    caption = "Source: fueleconomy.gov",
    x = "Engine Displacement (Liters)",
    y = "Highway Miles Per Gallon",
    color = "Vehicle Class" # Changes the legend title
  )

You can also use `ggtitle()`, `xlab()`, and `ylab()` for individual labels.

Themes

`ggplot2` includes several built-in themes that control the overall look (background, gridlines, fonts). Apply them by adding a `theme_*()` function.

p + theme_bw() # Black and white theme
p + theme_minimal() # Minimal theme
p + theme_classic() # Classic theme without gridlines
# p + theme_void() # Empty theme (axes/grid removed)

You can make fine-grained adjustments using the `theme()` function, targeting specific elements like `plot.title`, `axis.text`, `legend.position`, etc. (This is a more advanced topic).

Scales

Scales control how data values are mapped to aesthetics. You can use `scale_*()` functions to customize axes, colors, shapes, etc.

# Example: Manually setting colors for vehicle class
p +
  scale_color_manual(values = c(
    "2seater" = "red", "compact" = "blue", "midsize" = "green",
    "minivan" = "purple", "pickup" = "orange", "subcompact" = "brown",
    "suv" = "black")
  )

# Example: Adjusting x-axis limits and breaks
p +
  scale_x_continuous(limits = c(1, 7), breaks = 1:7)

There are many `scale_*()` functions (`scale_fill_brewer`, `scale_shape_manual`, `scale_y_log10`, etc.) for detailed control.

Saving Your Plots (`ggsave`).

Once you've created a plot you like, you'll want to save it to a file (e.g., for reports or presentations). The `ggsave()` function is the recommended way to save plots created with `ggplot2`.

Usage

`ggsave()` conveniently saves the last plot displayed by default, or you can explicitly pass a plot object (like `p` we created earlier).

# First, create the plot and maybe display it
my_plot <- ggplot(mpg, aes(x = class)) + geom_bar() +
              labs(title = "Counts of Vehicle Classes")
print(my_plot)

# Save the last plot displayed (my_plot) as a PNG file
ggsave(filename = "plots/class_counts.png") # Assumes a 'plots' subdirectory exists

# Save a specific plot object (p) as a PDF, specifying dimensions
ggsave(filename = "plots/displacement_vs_hwy.pdf", plot = p, width = 8, height = 6, units = "in")

# Save with higher resolution (dots per inch) for PNG
ggsave(filename = "plots/high_res_plot.png", plot = my_plot, dpi = 300)

Key arguments:

  • `filename`: The path and name of the output file (the file type is inferred from the extension, e.g., `.png`, `.pdf`, `.jpg`, `.svg`).
  • `plot`: The plot object to save (defaults to the last plot displayed).
  • `width`, `height`: Dimensions of the saved plot.
  • `units`: Units for width and height (`"in"`, `"cm"`, `"mm"`).
  • `dpi`: Resolution (dots per inch) for raster formats like PNG/JPG.

Basic Statistical Functions in R.

Beyond visualization, R is fundamentally a statistical programming language. Base R includes many functions for calculating common descriptive statistics.

These functions typically operate on numeric vectors. Missing values (`NA`) often require special handling (e.g., using the `na.rm = TRUE` argument).

data_vector <- c(1, 5, 2, 8, 3, NA, 7)

# Mean (Average)
mean(data_vector) # Output: NA (because of the NA value)
mean(data_vector, na.rm = TRUE) # Output: [1] 4.333333

# Median (Middle value)
median(data_vector, na.rm = TRUE) # Output: [1] 4

# Standard Deviation
sd(data_vector, na.rm = TRUE) # Output: [1] 2.734263

# Variance
var(data_vector, na.rm = TRUE) # Output: [1] 7.47619

# Minimum and Maximum
min(data_vector, na.rm = TRUE) # Output: [1] 1
max(data_vector, na.rm = TRUE) # Output: [1] 8

# Range (Minimum and Maximum)
range(data_vector, na.rm = TRUE) # Output: [1] 1 8

# Quantiles (e.g., 0%, 25%, 50%, 75%, 100%)
quantile(data_vector, na.rm = TRUE)
# Output:
#   0%  25%  50%  75% 100%
# 1.00 2.25 4.00 6.50 8.00

These functions are building blocks for exploring and understanding your data.

The `summary()` Function.

The `summary()` function is a generic function in R, meaning its behavior depends on the type of object passed to it. It provides a quick overview of different kinds of R objects.

Summary of a Numeric Vector

For a numeric vector, `summary()` provides the "six-number summary": Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.

data_vector <- c(1, 5, 2, 8, 3, NA, 7)
summary(data_vector)
# Output:
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
# 1.00    2.25    4.00    4.33    6.50    8.00       1

Summary of a Factor or Character Vector

For factors or character vectors, `summary()` typically provides frequency counts for the most common categories.

classes <- factor(c("A", "B", "A", "C", "B", "A"))
summary(classes)
# Output:
# A B C
# 3 2 1

Summary of a Data Frame

For a data frame, `summary()` applies the appropriate summary function to each column based on its data type.

summary(iris)
# Output (abbreviated):
# Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
# Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
# 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
# Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
# Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
# 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
# Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

The `summary()` function is an excellent first step in exploring any new dataset or object in R.

Syllabus