Week 4: Organizing Data with Lists & Data Frames

Learn about R's flexible lists and the essential data frame structure.

Explore Chapter 4

Chapter 4: Data Structures - Lists & Data Frames

Introduction to Lists: Flexible Containers.

While atomic vectors require all elements to be of the same type, R lists are more flexible. A list is an ordered collection of components (elements) where each component can be of a different data type. Lists can contain vectors, matrices, other lists, functions, or any R object.

Creating Lists

Lists are created using the `list()` function. You can also name the components of a list during creation.

# A list with components of different types
my_list <- list(10, "hello", TRUE, c(1,2,3))
print(my_list)

# A named list
student_info <- list(
  name = "Alice",
  age = 21L,
  major = "Statistics",
  courses = c("STAT101", "MATH200", "CS150")
)
print(student_info)

Accessing List Elements

Accessing elements in lists requires different operators depending on whether you want to extract a sub-list or a specific component:

`[]` (Single Brackets): Extracts a sub-list. The result is always a list.

student_info[1]       # Output: A list containing the 'name' component: list(name = "Alice")
student_info[c(1,3)]  # Output: A list containing 'name' and 'major'

`[[]]` (Double Brackets): Extracts a single component from the list. The result's type depends on the type of the extracted component. Can use index or name.

student_info[[1]]        # Output: [1] "Alice" (Character vector)
student_info[["major"]]   # Output: [1] "Statistics" (Character vector)
student_info[[4]]        # Output: [1] "STAT101" "MATH200" "CS150" (Character vector)

`$` Operator: A convenient shortcut to extract a single component by its name.

student_info$name    # Output: [1] "Alice"
student_info$courses # Output: [1] "STAT101" "MATH200" "CS150"

Understanding the difference between `[]` and `[[]]` or `$` is crucial when working with lists.

Introduction to Data Frames.

Data frames are the most important data structure for storing and working with tabular data (like spreadsheets or database tables) in R. They are two-dimensional structures where columns can contain different data types, but all elements within a single column must be of the same type.

Technically, a data frame is a special kind of list where each component is an atomic vector (or factor, matrix, etc.) of the same length. Each component represents a column, and the elements within each component represent the rows.

Creating Data Frames

You typically create data frames using the `data.frame()` function, providing named vectors as arguments, where each argument becomes a column.

# Create vectors for columns
ids <- 1:3
names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 30, 22)
is_student <- c(TRUE, FALSE, TRUE)

# Create the data frame
my_dataframe <- data.frame(
  ID = ids,
  Name = names,
  Age = ages,
  Student = is_student
)

print(my_dataframe)
# Output:
#   ID    Name Age Student
# 1  1   Alice  25    TRUE
# 2  2     Bob  30   FALSE
# 3  3 Charlie  22    TRUE

Data frames are central to data analysis in R, and many packages (like `dplyr` and `ggplot2`) are designed to work primarily with them.

Viewing and Inspecting Data Frames.

When working with data frames, especially larger ones, it's essential to have tools to inspect their structure and content without printing the entire dataset.

`head(df, n=6)`: Shows the first `n` rows (default is 6).
`tail(df, n=6)`: Shows the last `n` rows (default is 6).
`str(df)`: Displays the structure compactly, showing the total observations (rows), variables (columns), the data type of each column, and the first few values. Highly recommended!
`summary(df)`: Provides a statistical summary for each column (min, max, mean, median, quartiles for numeric; counts for factors/characters).
`dim(df)`: Returns the dimensions (number of rows and columns) as a vector `c(rows, cols)`.
`nrow(df)`: Returns the number of rows.
`ncol(df)`: Returns the number of columns.
`names(df)` or `colnames(df)`: Returns the column names.
`rownames(df)`: Returns the row names (often just sequence numbers by default).

# Assuming 'my_dataframe' from the previous section
head(my_dataframe)
str(my_dataframe)
summary(my_dataframe)
dim(my_dataframe)
names(my_dataframe)

Importing Data from CSV Files.

Often, your data will reside in external files. Comma-Separated Value (CSV) files are a very common format for storing tabular data. R provides functions to easily read these files into data frames.

The `read.csv()` Function

The primary function for reading CSV files is `read.csv()`.

my_data <- read.csv(file = "path/to/your/file.csv")

Key arguments for `read.csv()`:

`file`: The path to the CSV file (a string). This can be a local path or even a URL.
`header`: A logical value (`TRUE` or `FALSE`) indicating whether the first row of the file contains variable (column) names. Default is `TRUE`.
`sep`: The character used to separate values within a row. Default is a comma (`","`). For tab-separated files, use `"\t"`.
`stringsAsFactors`: Historically, this defaulted to `TRUE`, converting character columns into factors. In recent R versions (4.0.0+), the default is `FALSE`, which is often preferred. It's good practice to set it explicitly (`stringsAsFactors = FALSE`) if you want character data to remain as character vectors.
`na.strings`: A character vector of strings that should be interpreted as missing values (`NA`).

Example

Imagine you have a CSV file named `data.csv` in the same directory as your R script with the following content:

Name,Score,Group
David,85,A
Eve,92,B
Frank,78,A

You would read it like this:

dataset <- read.csv("data.csv", stringsAsFactors = FALSE)
print(dataset)
str(dataset)

There are other functions like `read.table()` (more general) and functions from packages like `readr` (part of the Tidyverse, often faster and more consistent) for reading various types of delimited files.