Week 8: Wrangling Data with dplyr

Enter the Tidyverse and learn essential data manipulation verbs.

Explore Chapter 8

Chapter 8: Introduction to Data Manipulation with `dplyr`

The Tidyverse Concept.

The Tidyverse is an opinionated collection of R packages designed for data science. All packages in the Tidyverse share an underlying design philosophy, grammar, and data structures, making them work seamlessly together.

Key aspects of the Tidyverse philosophy include:

Using "tidy data" principles (each variable a column, each observation a row, each type of observational unit a table).
Promoting readable code, often through the use of pipes.
Providing consistent function names and arguments.
Focusing on human-readable data manipulation and visualization.

Some core packages within the Tidyverse include:

`dplyr`: For data manipulation (filtering, selecting, mutating, arranging, summarising).
`ggplot2`: For declarative data visualization.
`tidyr`: For tidying data (reshaping layout).
`readr`: For reading rectangular data files (like CSVs).
`purrr`: For functional programming tools.
`tibble`: A modern reimagining of data frames.
`stringr`: For working with strings.
`forcats`: For working with factors.

We will start by focusing on `dplyr`, a cornerstone of the Tidyverse for data transformation.

Introduction to `dplyr` & Installation.

`dplyr` provides a consistent set of "verbs" (functions) that help you solve the most common data manipulation challenges. It's designed to be fast and intuitive.

Installing Packages

If you haven't used `dplyr` or the Tidyverse before, you need to install it first. Packages only need to be installed once per R installation (or library path).

You can install just `dplyr` or the entire Tidyverse suite (which includes `dplyr` and other useful packages). Installing the whole Tidyverse is often recommended.

# Install only dplyr
install.packages("dplyr")

# Install the complete Tidyverse (recommended)
install.packages("tidyverse")

Run one of these commands in your R console. You might be asked to choose a CRAN mirror (a server to download from) – pick one geographically close to you.

Loading Packages

Once installed, you need to load the package into your current R session using the `library()` function each time you start a new R session and want to use its functions.

# Load dplyr (or tidyverse)
library(dplyr)

# Or, if you installed the whole suite:
library(tidyverse) # This loads dplyr and several other core packages

Now you're ready to use `dplyr` functions!

Core `dplyr` Verbs.

`dplyr` revolves around a few key verbs (functions) that take a data frame as the first argument and return a modified data frame. We'll use the built-in `iris` dataset for examples.

# Load dplyr if not already loaded
library(dplyr)

# View the structure of the iris dataset
str(iris)

1. `filter()`: Subset rows based on conditions.

Selects rows that meet specified logical criteria.

# Filter for rows where Species is 'setosa'
setosa_data <- filter(iris, Species == "setosa")
head(setosa_data)

# Filter for rows where Sepal.Length is greater than 7
long_sepals <- filter(iris, Sepal.Length > 7)
head(long_sepals)

# Filter with multiple conditions (Species is virginica AND Petal.Width > 2)
virginica_wide <- filter(iris, Species == "virginica" & Petal.Width > 2)
head(virginica_wide)

2. `select()`: Subset columns by name.

Keeps or removes columns.

# Select only the Sepal.Length and Species columns
sepal_species <- select(iris, Sepal.Length, Species)
head(sepal_species)

# Select all columns *except* Species
no_species <- select(iris, -Species)
head(no_species)

# Select columns starting with "Petal"
petal_cols <- select(iris, starts_with("Petal"))
head(petal_cols)

3. `arrange()`: Reorder rows based on column values.

Sorts the rows.

# Arrange rows by Petal.Length (ascending by default)
arranged_data <- arrange(iris, Petal.Length)
head(arranged_data)

# Arrange rows by Petal.Length in descending order
arranged_desc <- arrange(iris, desc(Petal.Length))
head(arranged_desc)

# Arrange by Species, then by Sepal.Width within each species
arranged_multi <- arrange(iris, Species, Sepal.Width)
head(arranged_multi)

4. `mutate()`: Create new columns or modify existing ones.

Computes and adds new variables.

# Create a new column 'Petal.Area'
iris_with_area <- mutate(iris, Petal.Area = Petal.Length * Petal.Width)
head(iris_with_area)

# Create multiple new columns
iris_mod <- mutate(iris,
                     Petal.Area = Petal.Length * Petal.Width,
                     Sepal.Area = Sepal.Length * Sepal.Width
                    )
head(iris_mod)

These verbs form the foundation of data manipulation with `dplyr`. We'll cover `summarise()` and `group_by()` next week.

The Pipe Operator (`%>%` or `|>`).

Often, you want to perform multiple data manipulation steps sequentially. You could do this by saving intermediate results, but that can be verbose. Alternatively, you could nest function calls, but that quickly becomes hard to read:

# Nested approach (hard to read)
head(arrange(select(filter(iris, Species == "setosa"), Sepal.Length, Petal.Length), desc(Sepal.Length)))

The pipe operator provides a much more elegant solution. It takes the result of the expression on its left-hand side and "pipes" it into the first argument of the function on its right-hand side.

The most common pipe is `%>%` from the `magrittr` package (which is automatically loaded by `dplyr` and the `tidyverse`). Base R version 4.1.0 introduced a native pipe `|>` which works similarly in many cases.

Syntax with `%>%`

data_frame %>%
  verb1(argument2, ...) %>%
  verb2(argument2, ...) %>%
  ...

This reads like "Take `data_frame`, THEN apply `verb1`, THEN apply `verb2`, ..."

Example using Pipe

Let's rewrite the nested example using pipes:

# Pipe approach (much more readable)
result <- iris %>%
  filter(Species == "setosa") %>% # Take iris, THEN filter
  select(Sepal.Length, Petal.Length) %>% # THEN select columns
  arrange(desc(Sepal.Length)) %>% # THEN arrange rows
  head() # THEN take the first few rows

print(result)

The pipe makes complex data manipulation sequences much easier to write, read, and debug.