Skip to content

Grouping & Summarizing Data in R

Often in data analysis, we need to calculate summary statistics for different groups within our dataset. For example, we might want to find the average weight of animals based on their sex, or the total count of observations per year. The dplyr package, part of the Tidyverse, provides powerful and intuitive tools for these tasks, primarily through the group_by() and summarise() (or summarize()) functions. This article will guide you through various ways to group and summarize your data effectively.

Setup & Initial Data Inspection

First, let’s load the necessary libraries and our example dataset. We’ll be using the knz_bison dataset from the lterdatasampler package, which contains data on bison from the Konza Prairie LTER site. We’ll also create an approx_age column by subtracting the animal’s year of birth (animal_yob) from the recording year (rec_year).

# --- 0. Setup & Initial Data Inspection --- #
library(lterdatasampler)
library(tidyverse)

bison_data <- lterdatasampler::knz_bison %>%
  mutate(approx_age = rec_year - animal_yob)

bison_data
# A tibble: 8,325 × 9
   data_code rec_year rec_month rec_day animal_code animal_sex animal_weight animal_yob approx_age
   <chr>        <dbl>     <dbl>   <dbl> <chr>       <chr>              <dbl>      <dbl>      <dbl>
 1 CBH01         1994        11       8 813         F                    890       1981         13
 2 CBH01         1994        11       8 834         F                   1074       1983         11
 3 CBH01         1994        11       8 B-301       F                   1060       1983         11
 4 CBH01         1994        11       8 B-402       F                    989       1984         10
 5 CBH01         1994        11       8 B-403       F                   1062       1984         10
 6 CBH01         1994        11       8 B-502       F                    978       1985          9
 7 CBH01         1994        11       8 B-503       F                   1068       1985          9
 8 CBH01         1994        11       8 B-504       F                   1024       1985          9
 9 CBH01         1994        11       8 B-601       F                    978       1986          8
10 CBH01         1994        11       8 B-602       F                   1188       1986          8
# ℹ 8,315 more rows
# ℹ Use `print(n = ...)` to see more rows

Plains Bison at Konza Prairie

A herd of bison grazing on the prairie.

The data used in these examples come from long-term studies of Plains bison (Bison bison bison) at the Konza Prairie Biological Station, an NSF Long-Term Ecological Research (LTER) site in northeastern Kansas. Understanding population dynamics, such as sex ratios and age structures across years, is crucial for managing these iconic prairie animals and their grassland ecosystem.

Basic Grouping & Summarizing

The core idea is to first group the data by one or more variables using group_by(), and then apply summary functions using summarise(). The summarise() function creates a new data frame containing one row for each group, with new columns representing the summary statistics.

Group by One Variable

When working within summarise(), you have access to several helpful context-dependent functions. A commonly used one is n(), which returns the number of rows (observations) in the current group. Another useful one is n_distinct(variable_name), which counts the number of unique values of a specified variable within the current group. In this example, we will use n() to count observations.

Let’s group the bison data by animal_sex and calculate the mean weight, standard deviation of weight, and the number of observations for each sex using the aforementioned n() helper function. Note also the use of na.rm = TRUE within mean() and sd() to ensure that missing values (NA) in the animal_weight column are ignored during calculations.

# Example: Group by one variable ('animal_sex') and calculate mean weight,
# standard deviation of weight, and count of observations.
bison_data %>%
  group_by(animal_sex) %>%
  summarise(
    mean_weight = mean(animal_weight, na.rm = TRUE),
    sd_weight = sd(animal_weight, na.rm = TRUE),
    n_obs = n()
  )
# A tibble: 2 × 4
  animal_sex mean_weight sd_weight n_obs
  <chr>            <dbl>     <dbl> <int>
1 F                 762.      282.  5285
2 M                 728.      420.  3040

The output shows that we have 5285 female (F) bison with a mean weight of approximately 762 units, and 3040 male (M) bison with a mean weight of approximately 728 units.

Group by Multiple Variables

We can also group by multiple variables. For instance, let’s calculate the mean weight of bison grouped by both rec_year (recording year) and animal_sex.

# Example: Group by multiple variables ('rec_year', 'animal_sex')
# and calculate mean weight.
bison_data %>%
  group_by(rec_year, animal_sex) %>%
  summarise(mean_weight = mean(animal_weight, na.rm = TRUE))
# A tibble: 52 × 3
# Groups:   rec_year [26]
   rec_year animal_sex mean_weight
      <dbl> <chr>            <dbl>
 1     1994 F                 700.
 2     1994 M                 532.
 3     1995 F                 826.
 4     1995 M                 812.
 5     1996 F                 733.
 6     1996 M                 696.
#  ... (output truncated for brevity, showing first 6 rows, see R script for full output)

This provides a more granular view, showing how mean weight varies by sex within each year. Notice the output indicates # Groups: rec_year [26]. This means that after the summarise() operation, the resulting tibble is still grouped by the first grouping variable, rec_year. This can be important for subsequent operations. We’ll discuss how to manage this next.

Controlling Grouping Structure after summarise()

When you use summarise() on data grouped by multiple variables, it peels off one layer of grouping by default. For example, if you group_by(var1, var2) and then summarise(), the result will be grouped by var1. You can control this behavior with the .groups argument in summarise() or by using ungroup().

  • .groups = "drop_last": (Default) Peels off the last grouping variable.
  • .groups = "drop": Drops all grouping levels. Equivalent to calling ungroup() afterwards.
  • .groups = "keep": Keeps the same grouping structure as the input.
  • .groups = "rowwise": Treats each row as a group. (Less common with summarise).

Using ungroup()

If you want to perform further operations on the summarized data that should not be group-wise, you can explicitly remove all grouping with ungroup(). In this example, after calculating mean weight per year and sex, we ungroup() and then calculate an overall mean weight across these summarized values.

# Example: Using ungroup().
# Useful if you want to perform further operations that shouldn't be group-wise.
bison_data %>%
  group_by(rec_year, animal_sex) %>%
  summarise(
    mean_weight = mean(animal_weight, na.rm = TRUE),
    .groups = "drop" # Alternatively, you can use ungroup() after summarise()
  ) %>%
  # Without ungroup() or .groups = "drop", 
  # the mutate() would be applied to each group of rec_year separately
  mutate(
    overall_mean_weight = mean(mean_weight, na.rm = TRUE)
  )
# A tibble: 52 × 4
   rec_year animal_sex mean_weight overall_mean_weight
      <dbl> <chr>            <dbl>               <dbl>
 1     1994 F                 700.                739.
 2     1994 M                 532.                739.
 3     1995 F                 826.                739.
 4     1995 M                 812.                739.
 5     1996 F                 733.                739.
 6     1996 M                 696.                739.
#  ... (output truncated for brevity, see R script for full output)

Here, setting .groups = "drop" in summarise() achieves the same as piping to ungroup(). If we had omitted this or ungroup(), the mutate() for overall_mean_weight would have calculated the mean of mean_weight *within each rec_year* (because the data would still be grouped by rec_year), which is not what we intended here.

Multiple Summary Statistics & Columns with summarise() and across()

The across() function is a powerful tool to apply the same transformation or summary function(s) to multiple columns. It’s often used within summarise() or mutate().

Using across() to Apply One Summary Function to Multiple Columns

Let’s calculate the mean of both animal_weight and our newly created approx_age, grouped by animal_sex.

A key feature of across() is the .names argument, which allows you to control the names of the output columns. It uses a template string where:

  • {.col} refers to the original column name being processed.
  • {.fn} refers to the name of the function being applied (useful when applying multiple functions, see next example).

In this example, .names = "mean_{.col}" will create columns like mean_animal_weight and mean_approx_age.

# Example: Using across() to apply one summary function to multiple columns.
# Calculate mean of 'animal_weight' and 'approx_age' by 'animal_sex'.
bison_data %>%
  group_by(animal_sex) %>%
  summarise(across(
    .cols = c(animal_weight, approx_age),
    .fns = mean,
    # Other arguments to pass to the function being applied (mean)
    na.rm = TRUE,
    # .names controls output column names
    .names = "mean_{.col}"
  ))
# A tibble: 2 × 3
  animal_sex mean_animal_weight mean_approx_age
  <chr>                   <dbl>           <dbl>
1 F                        762.            4.55
2 M                        728.            1.65

Using across() to Apply Multiple Summary Functions to Multiple Columns

You can also apply a list of functions with across(). Here, we calculate both the mean (named avg) and standard deviation (named stdev) for animal_weight and approx_age, grouped by animal_sex. We use the purrr-style lambda function syntax (e.g., ~ mean(., na.rm = TRUE)), where . represents the column data being passed to the function.

The .names argument now uses both {.col} for the original column name and {.fn} for the name of the function from our list (avg or stdev). So, .names = "{.col}_{.fn}" will generate names like animal_weight_avg, animal_weight_stdev, etc.

# Example: Using across() to apply multiple summary functions to multiple columns.
# Calculate mean and sd for 'animal_weight' and 'approx_age' by 'animal_sex'.
bison_data %>%
  group_by(animal_sex) %>%
  summarise(across(
    .cols = c(animal_weight, approx_age),
    .fns = list(
      avg = ~ mean(., na.rm = TRUE),
      stdev = ~ sd(., na.rm = TRUE)
    ),
    # .names controls output column names
    # .fn is the function name from the list (avg, stdev)
    # .col is the original column name
    .names = "{.col}_{.fn}"
  ))
# A tibble: 2 × 5
  animal_sex animal_weight_avg animal_weight_stdev approx_age_avg approx_age_stdev
  <chr>                  <dbl>               <dbl>          <dbl>            <dbl>
1 F                       762.                282.           4.55             4.52
2 M                       728.                420.           1.65             1.88

Advanced Summaries

Proportional Composition

Grouping and summarizing can be chained for more complex analyses. Let’s tackle a common scenario: calculating the proportional composition of subgroups. Our goal is to find, for each recording year, the percentage of male and female bison, such that within each year, these percentages sum to 100%.

The strategy involves several steps:

  • Filter out any rows where approx_age or animal_sex is missing, as these are crucial for our calculation.
  • Group by rec_year and animal_sex.
  • Count the number of bison in each group (count_in_year = n()). We use .groups = "drop" here to remove all grouping after this initial summary, making the next step simpler. (Alternatively, one could use .groups = "drop_last" or ungroup() %>% group_by(rec_year)).
  • Re-group by just rec_year.
  • Within each year, calculate the total number of bison for that year (total_for_year = sum(count_in_year)).
  • Calculate the percentage for each sex within that year: (count_in_year / total_for_year) * 100.
  • ungroup() for a clean final tibble and arrange() for readability.
# Example: Calculating Proportional Composition within Subgroups
# Goal: For each recording year, calculate the male and female percentage of bison for that year
# Therefore each year's percentages should sum to 100%

bison_sex_pct_by_year <- bison_data %>%
  filter(!is.na(approx_age) & !is.na(animal_sex)) %>% # Ensure age and sex are available
  group_by(rec_year, animal_sex) %>%
  summarise(count_in_year = n(), .groups = "drop") %>% # Count per sex per year, then drop all groups
  group_by(rec_year) %>% # Re-group by year to sum counts within each year
  mutate(
    total_for_year = sum(count_in_year),
    percentage_of_sex_in_year = (count_in_year / total_for_year) * 100
  ) %>%
  ungroup() %>% # Remove grouping
  arrange(rec_year, animal_sex) # Order for easier viewing

bison_sex_pct_by_year
# A tibble: 52 × 5
   rec_year animal_sex count_in_year total_for_year percentage_of_sex_in_year
      <dbl> <chr>              <int>          <int>                     <dbl>
 1     1994 F                     78            144                      54.2
 2     1994 M                     66            144                      45.8
 3     1995 F                     71            136                      52.2
 4     1995 M                     65            136                      47.8
 5     1996 F                    114            193                      59.1
 6     1996 M                     79            193                      40.9
#  ... (output truncated for brevity, see R script for full output)

To verify our calculation, we can group the resulting bison_sex_pct_by_year data frame by rec_year and sum the percentage_of_sex_in_year. Each year should sum to 100%.

# Verification: The total percentage for each year should be 100
bison_sex_pct_by_year %>%
  group_by(rec_year) %>%
  summarise(total_percentage = sum(percentage_of_sex_in_year), .groups = "drop")
# A tibble: 26 × 2
   rec_year total_percentage
      <dbl>            <dbl>
 1     1994              100
 2     1995              100
 3     1996              100
 4     1997              100
 5     1998              100
 6     1999              100
#  ... (output truncated for brevity, see R script for full output)

The verification confirms our method is correct, with each year’s percentages summing to 100 (allowing for minor floating-point inaccuracies if they were to occur, though not visible here).

Conclusion

The group_by() and summarise() functions, especially when combined with across(), provide a flexible and powerful framework for data aggregation and summary in R. By understanding how to group by single or multiple variables, control grouping structures, and apply various summary functions, you can efficiently derive meaningful insights from your datasets. Remember to consider how NA values are handled (e.g., using na.rm = TRUE) and how the grouping structure persists or is modified after summarization to ensure your analyses are accurate and lead to the intended results.