Week 9: Grouping, Summarising, and Tidying Data

Master grouped operations, reshape data layouts, and combine datasets.

Explore Chapter 9

Chapter 9: Advanced Data Manipulation & Tidying

`group_by()` and `summarise()`: Grouped Operations.

A very common data analysis task is to calculate summary statistics for different groups within your data. `dplyr` makes this easy using `group_by()` in combination with `summarise()`.

`group_by()`

The `group_by()` function doesn't change the data frame visually, but it adds grouping metadata. Subsequent `dplyr` verbs (most notably `summarise()`) will then operate independently on each group.

`summarise()` (or `summarize()`)

The `summarise()` function collapses a data frame into a single row, usually containing summary statistics. When used after `group_by()`, it collapses each group into a single summary row.

Common summary functions to use inside `summarise()` include:

  • `mean(column)`: Average value
  • `median(column)`: Median value
  • `sd(column)`: Standard deviation
  • `min(column)`: Minimum value
  • `max(column)`: Maximum value
  • `n()`: Counts the number of rows in the current group (no arguments needed).
  • `sum(column)`: Sum of values

Example

library(dplyr)

# Calculate average Sepal.Length for each Species in the iris dataset
species_avg_sepal <- iris %>%
  group_by(Species) %>%
  summarise(Avg.Sepal.Length = mean(Sepal.Length))

print(species_avg_sepal)
# Output:
# # A tibble: 3 × 2
#   Species    Avg.Sepal.Length
#   <fct>                 <dbl>
# 1 setosa                 5.01
# 2 versicolor             5.94
# 3 virginica              6.59

# Calculate multiple summaries: count and average Petal.Width per Species
species_summary <- iris %>%
  group_by(Species) %>%
  summarise(
    Count = n(),
    Avg.Petal.Width = mean(Petal.Width),
    Max.Petal.Length = max(Petal.Length)
  )

print(species_summary)

`ungroup()`

After performing grouped operations, the data frame might retain its grouping structure. If you want to remove the grouping for subsequent operations, use `ungroup()`.

grouped_data <- iris %>% group_by(Species)
# ... some grouped operations ...
ungrouped_data <- grouped_data %>% ungroup()

Introduction to `tidyr` for Tidy Data.

Data often comes in formats that are convenient for data entry or presentation but not ideal for analysis. "Tidy data" is a concept where:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The `tidyr` package (part of the Tidyverse) provides tools to reshape your data into a tidy format, making it easier to work with in `dplyr`, `ggplot2`, and other Tidyverse packages.

First, ensure `tidyr` is installed and loaded (it's included if you installed the `tidyverse` package):

# install.packages("tidyr") # If needed
library(tidyr)
# Or library(tidyverse)

Two fundamental functions in `tidyr` for reshaping data are `pivot_longer()` and `pivot_wider()`.

Reshaping Data: `pivot_longer()` and `pivot_wider()`.

Pivoting changes the layout of your data frame between "wide" and "long" formats.

`pivot_longer()`: From Wide to Long

Use `pivot_longer()` when some of your column names are actually values of a variable. It "lengthens" data by increasing the number of rows and decreasing the number of columns.

Key arguments:

  • `cols`: The columns to pivot into longer format (gather).
  • `names_to`: Name of the new column that will contain the original column names.
  • `values_to`: Name of the new column that will contain the values from the original pivoted columns.
# Example wide data
wide_data <- data.frame(
  student = c("Alice", "Bob"),
  test1_score = c(85, 90),
  test2_score = c(88, 85)
)
print(wide_data)

# Pivot test scores into longer format
long_data <- wide_data %>%
  pivot_longer(
    cols = c(test1_score, test2_score), # Or cols = ends_with("_score")
    names_to = "test_type",
    values_to = "score"
  )
print(long_data)
# Output:
# # A tibble: 4 × 3
#   student test_type   score
#   <chr>   <chr>       <dbl>
# 1 Alice   test1_score    85
# 2 Alice   test2_score    88
# 3 Bob     test1_score    90
# 4 Bob     test2_score    85

`pivot_wider()`: From Long to Wide

Use `pivot_wider()` when an observation is scattered across multiple rows. It "widens" data by increasing columns and decreasing rows.

Key arguments:

  • `names_from`: The column whose values will become the new column names.
  • `values_from`: The column whose values will fill the new columns.
# Using long_data from above
wide_again <- long_data %>%
  pivot_wider(
    names_from = test_type,
    values_from = score
  )
print(wide_again)
# Output (similar to original wide_data):
# # A tibble: 2 × 3
#   student test1_score test2_score
#   <chr>         <dbl>       <dbl>
# 1 Alice            85          88
# 2 Bob              90          85

Pivoting is crucial for preparing data for specific types of analysis or visualization.

Combining Datasets: Basic Joins.

Often, data needed for an analysis is spread across multiple tables or data frames. You need to combine them based on common "key" variables. `dplyr` provides intuitive functions for joining datasets.

The main join functions are:

  • `inner_join(x, y, by = "key")`: Keeps only rows where the key exists in both `x` and `y`.
  • `left_join(x, y, by = "key")`: Keeps all rows from `x` and adds matching columns from `y`. If no match in `y`, the new columns get `NA`.
  • `right_join(x, y, by = "key")`: Keeps all rows from `y` and adds matching columns from `x`. If no match in `x`, the new columns get `NA`.
  • `full_join(x, y, by = "key")`: Keeps all rows from both `x` and `y`. Where matches don't exist, `NA`s are filled in.

The `by` argument specifies the key column(s) to join on. If the key columns have the same name in both data frames, you can just provide the name (e.g., `by = "student_id"`). If they have different names, use a named vector (e.g., `by = c("student_id" = "pupil_identifier")`). If omitted, `dplyr` will try to join by all common column names.

Example

library(dplyr)

# Sample data frames
students <- data.frame(
  StudentID = 1:3,
  Name = c("Alice", "Bob", "Charlie")
)

enrollments <- data.frame(
  StudentID = c(1, 1, 2, 4), # Note: Student 4 is not in 'students'
  Course = c("Math", "Science", "Math", "Art")
)

# Inner Join: Keep only students present in both tables
inner_join_data <- inner_join(students, enrollments, by = "StudentID")
print(inner_join_data)
#   StudentID  Name  Course
# 1         1 Alice    Math
# 2         1 Alice Science
# 3         2   Bob    Math

# Left Join: Keep all students, add their enrollments (Charlie has NA for Course)
left_join_data <- left_join(students, enrollments, by = "StudentID")
print(left_join_data)
#   StudentID    Name  Course
# 1         1   Alice    Math
# 2         1   Alice Science
# 3         2     Bob    Math
# 4         3 Charlie    <NA>

Choosing the correct join type is crucial for combining datasets accurately.

Syllabus