Skip to content

Parts of a Whole Comparisons

In ecological studies, we often need to visualize how different components contribute to a whole. These “parts of a whole” comparisons are essential for understanding community composition, species distributions, and ecosystem structure. This article explores common visualization techniques for these types of data, including treemaps, stacked barplots, and grouped barplots. We’ll demonstrate these techniques using a microbiome dataset from the human gut.

Introduction to Parts of a Whole Visualizations

When studying ecological communities, we often need to visualize how different species or taxonomic groups contribute to the overall community. These “parts of a whole” comparisons are fundamental in ecology for understanding:

  • Community composition and structure
  • Species diversity and evenness
  • Changes in community composition across environmental gradients
  • Temporal changes in community structure

Several visualization techniques are commonly used for these comparisons:

  • Treemaps: Hierarchical visualizations that show proportions through nested rectangles
  • Stacked barplots: Show the composition of different groups across categories
  • Grouped barplots: Compare the same components across different groups
  • Dendrograms: Show hierarchical relationships between samples based on their composition (not covered in this article)

In this article, we’ll explore these visualization techniques using a microbiome dataset from the human gut. We’ll focus on treemaps, stacked barplots, and grouped barplots, which are particularly useful for visualizing taxonomic composition data.

The Dataset: Human Gut Microbiome

We’ll be using the Baxter colorectal cancer (CRC) dataset (Baxter et al., 2016), which includes stool samples from patients with different disease statuses. This dataset is particularly interesting because it allows us to explore how the gut microbiome composition varies across different disease states.

The dataset consists of three main components that we’ll join together:

  • Metadata: Information about each sample, including the disease status of the patient
  • OTU Counts: The abundance of each Operational Taxonomic Unit (OTU) in each sample
  • Taxonomic Information: The taxonomic classification of each OTU

Understanding Microbiome Data

Before diving into the visualizations, let’s understand some key concepts in microbiome research:

  • Operational Taxonomic Units (OTUs): These are clusters of similar DNA sequences that are used as proxies for species in microbiome studies. OTUs are typically defined by a similarity threshold (often 97%) in their DNA sequences.
  • Taxonomic Ranks: Biological classification follows a hierarchical structure: Kingdom → Phylum → Class → Order → Family → Genus → Species. In microbiome studies, we often focus on higher taxonomic ranks (like phylum) because species-level identification can be challenging.
  • Relative Abundance: The proportion of each OTU or taxonomic group relative to the total community. This is often expressed as a percentage and is useful for comparing communities with different total abundances.

Data Preprocessing – Loading, Cleaning, and Joining Data

Let’s load and examine each component of our dataset. Since this is a tutorial on plotting, we will not go into too great detail on the cleaning steps performed here. You are welcome to start with the raw dataset yourself and see how each step impacts the outcome.

In this first step, we will download the metadata and subset it to only include the indentifier of the sample and their disease status.

# Load the metadata
df_metadata <- read_tsv(
  "https://raw.githubusercontent.com/riffomonas/minimalR-raw_data/refs/heads/master/baxter.metadata.tsv",
  col_types = cols(
    sample = col_character(),
    Dx_Bin = col_character()
  )
) %>%
  select(sample, Dx_Bin) %>%
  rename(disease_status = Dx_Bin) %>%
  drop_na(disease_status) %>% # Drop rows with NA disease status
  mutate(
    disease_status = fct_relevel(disease_status, c("Normal", "High Risk Normal", "Adenoma", "Adv Adenoma", "Cancer"))
  )

# Assess the counts of disease status
df_metadata %>%
  group_by(disease_status) %>%
  count()
# A tibble: 5 × 2
# Groups:   disease_status [5]
  disease_status       n
  <fct>            <int>
1 Normal             122
2 High Risk Normal    50
3 Adenoma             89
4 Adv Adenoma        109
5 Cancer             120

We can see that patients can be classified into 5 different disease statuses: Normal, High Risk Normal, Adenoma, Adv Adenoma, and Cancer. Next, we will load the OTU count information and pivot it into tidy (long) format.

# Load the OTU counts
df_otu <- read_tsv(
  "https://raw.githubusercontent.com/riffomonas/minimalR-raw_data/refs/heads/master/baxter.subsample.shared",
  col_types = cols(
    Group = col_character(),
    .default = col_double()
  )
) %>%
  select(Group, starts_with("Otu")) %>%
  rename(sample = Group) %>% # This will be used to join with the metadata
  pivot_longer(cols = starts_with("Otu"), names_to = "otu", values_to = "count")

df_otu %>% head()
# A tibble: 6 × 3
  sample  otu       count
  <chr>   <chr>     <dbl>
1 2003650 Otu000001   346
2 2003650 Otu000002   267
3 2003650 Otu000003   289
4 2003650 Otu000004   243
5 2003650 Otu000005   263
6 2003650 Otu000006   681

The OTU counts data shows the raw abundance of each OTU in each sample. For example, in sample 2003650, OTU000001 has a count of 346, OTU000002 has a count of 267, and so on. We’ve used `pivot_longer()` to convert the data from wide format (where each OTU is a column) to long format (where each row represents a sample-OTU combination), which is more suitable for many visualization and analysis tasks.

# Load the taxa information
df_taxa <- read_tsv(
  "https://raw.githubusercontent.com/riffomonas/minimalR-raw_data/refs/heads/master/baxter.cons.taxonomy"
) %>%
  select(OTU, Taxonomy) %>%
  rename(
    otu = OTU, # This will be used to join with the OTU counts
    taxonomy = Taxonomy
  ) %>%
  mutate(
    taxonomy = str_replace_all(taxonomy, "\\(\\d+\\)", ""), # Remove bracketed numbers
    taxonomy = str_replace_all(taxonomy, "unclassified", "NA"), # Replace unclassified with "NA" will become R's NA
    taxonomy = str_replace_all(taxonomy, ";$", "") # Remove trailing semicolon
  ) %>%
  # Separate the taxonomy into separate columns via the ";" delimiter
  separate(
    taxonomy,
    into = c("kingdom", "phylum", "class", "order", "family", "genus"),
    sep = ";",
    remove = TRUE, # Remove the original taxonomy column
    convert = TRUE # Convert "NA" to NA
  )

df_taxa %>% head()

In database terminology, we can think of this as an association table, as it contains both the identifier of the patient/sample and the OTU identifier. This will ultimately allow us to combine all three dataframes together. But before that, we still need to get the taxonomic information:

# A tibble: 6 × 7
  otu       kingdom  phylum          class            order              family              genus           
  <chr>     <chr>    <chr>           <chr>            <chr>              <chr>               <chr>           
1 Otu000001 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Blautia         
2 Otu000002 Bacteria Bacteroidetes   Bacteroidia      Bacteroidales      Bacteroidaceae      Bacteroides     
3 Otu000003 Bacteria Bacteroidetes   Bacteroidia      Bacteroidales      Bacteroidaceae      Bacteroides     
4 Otu000004 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Akkermansia     
5 Otu000005 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Roseburia       
6 Otu000006 Bacteria Firmicutes      Clostridia       Clostridiales      Ruminococcaceae     Faecalibacterium

A lot of data cleaning was done here. Once again, you are welcome to view the raw data yourself to get a better understanding of what was done. What should be noted is that the `tidyverse` offers a very powerful function for such cases called `separate()`, which is used to split a column into multiple columns based on a delimiter. In this case, we had split the taxonomy string into separate columns for each taxonomic rank, which makes it easier to analyze and visualize the data at different taxonomic levels.

Ok, now that we have the three dataframes, we can join them together to create a single comprehensive dataset:

# We can now join the metadata, OTU counts, and taxa information
df_combined <- df_otu %>%
  inner_join(df_taxa, by = "otu") %>%
  inner_join(df_metadata, by = "sample")

df_combined
# A tibble: 2,683,240 × 10
   sample  otu       count kingdom  phylum          class            order              family              genus            disease_status  
   <chr>   <chr>     <dbl> <chr>    <chr>           <chr>            <chr>              <chr>               <chr>            <fct>           
 1 2003650 Otu000001   346 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Blautia          High Risk Normal
 2 2003650 Otu000002   267 Bacteria Bacteroidetes   Bacteroidia      Bacteroidales      Bacteroidaceae      Bacteroides      High Risk Normal
 3 2003650 Otu000003   289 Bacteria Bacteroidetes   Bacteroidia      Bacteroidales      Bacteroidaceae      Bacteroides      High Risk Normal
 4 2003650 Otu000004   243 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Akkermansia      High Risk Normal
 5 2003650 Otu000005   263 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Roseburia        High Risk Normal
 6 2003650 Otu000006   681 Bacteria Firmicutes      Clostridia       Clostridiales      Ruminococcaceae     Faecalibacterium High Risk Normal
 7 2003650 Otu000007   244 Bacteria Bacteroidetes   Bacteroidia      Bacteroidales      Bacteroidaceae      Bacteroides      High Risk Normal
 8 2003650 Otu000008    88 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Anaerostipes     High Risk Normal
 9 2003650 Otu000009   108 Bacteria Firmicutes      Clostridia       Clostridiales      Lachnospiraceae     Blautia          High Risk Normal
10 2003650 Otu000010   253 Bacteria Firmicutes      Clostridia       Clostridiales      NA                  NA               High Risk Normal

The combined dataset now contains all the information we need: the sample ID, OTU ID, count, taxonomic classification, and disease status. This allows us to analyze and visualize the microbiome composition across different disease statuses.

Data Preprocessing – Calculating Relative Abundance

Before creating our visualizations, we need to preprocess the data. For microbiome studies, a common preprocessing step is to calculate the relative abundance of each taxonomic group within each sample. This allows us to compare samples with different total abundances.

Let’s calculate the relative abundance of each phylum for each sample:

# For most of the plots, we will be working with the relative abundance of each phylum per disease status.
df_rel_abund_phylum <- df_combined %>%
  select(disease_status, sample, phylum, count) %>%
  drop_na(phylum) %>%
  # Calculate the relative abundance of each phylum per sample
  # First, by updating the counts to be the sum of the counts per sample
  group_by(disease_status, sample, phylum) %>%
  summarise(count = sum(count)) %>%
  ungroup() %>%
  # Then, by calculating the relative abundance
  group_by(sample) %>%
  mutate(rel_abund = count / sum(count)) %>%
  ungroup()

df_rel_abund_phylum
# A tibble: 8,330 × 5
   disease_status sample  phylum                      count rel_abund
   <fct>          <chr>   <chr>                       <dbl>     <dbl>
 1 Normal         2013660 Acidobacteria                   0    0     
 2 Normal         2013660 Actinobacteria                908    0.0864
 3 Normal         2013660 Bacteroidetes                1162    0.111 
 4 Normal         2013660 Candidatus_Saccharibacteria     0    0     
 5 Normal         2013660 Deferribacteres                 0    0     
 6 Normal         2013660 Deinococcus-Thermus             0    0     
 7 Normal         2013660 Elusimicrobia                   0    0     
 8 Normal         2013660 Firmicutes                   7679    0.731 
 9 Normal         2013660 Fusobacteria                    0    0     
10 Normal         2013660 Lentisphaerae                   0    0     

For some visualizations, we’ll also pool rare phyla (those with less than 5% relative abundance) into an “Other” category to simplify the visualization:

# We will also pool categories with less than 5% relative abundance into "Other".
df_rel_abund_phylum_pooled <- df_rel_abund_phylum %>%
  # Calculate the mean relative abundance of each phylum per disease status
  group_by(disease_status, phylum) %>%
  summarise(mean_rel_abund = 100 * mean(rel_abund)) %>%
  # Pool categories with less than 5% relative abundance into "Other"
  mutate(phylum = ifelse(mean_rel_abund < 5, "Other", phylum)) %>%
  # To account for the pooling, we will sum the mean relative abundances
  ungroup() %>%
  group_by(disease_status, phylum) %>%
  summarise(mean_rel_abund = sum(mean_rel_abund), .groups = "drop")
# A tibble: 15 × 3
   disease_status   phylum        mean_rel_abund
   <fct>            <chr>                  <dbl>
 1 Normal           Bacteroidetes          23.6 
 2 Normal           Firmicutes             66.8 
 3 Normal           Other                   9.59
 4 High Risk Normal Bacteroidetes          27.5 
 5 High Risk Normal Firmicutes             63.5 
 6 High Risk Normal Other                   8.98
 7 Adenoma          Bacteroidetes          29.8 
 8 Adenoma          Firmicutes             62.3 
9 Adenoma          Other                   7.85
10 Adv Adenoma      Bacteroidetes          23.4 
11 Adv Adenoma      Firmicutes             66.0 
12 Adv Adenoma      Other                  10.6 
13 Cancer           Bacteroidetes          24.7 
14 Cancer           Firmicutes             64.0 
15 Cancer           Other                  11.4 

Note the use of the `.groups = “drop”` argument in the `summarise()` function. This argument controls what happens to the grouping variables after the summarization. Setting it to “drop” means that all grouping variables are removed, resulting in an ungrouped data frame. This is useful when you want to perform further operations on the data without the grouping structure.

Treemaps

Treemaps are hierarchical visualizations that display proportions through nested rectangles. They are particularly useful for visualizing hierarchical data where you want to show both the overall structure and the relative sizes of components.

In ecology, treemaps are often used to visualize:

  • Taxonomic composition of communities
  • Biomass distribution across species
  • Energy flow through food webs
  • Habitat composition

Let’s create a treemap to visualize the relative abundance of bacterial phyla across different disease statuses:

treemap::treemap(df_rel_abund_phylum,
  index = c("disease_status", "phylum"),
  vSize = "rel_abund",
  type = "index",
  algorithm = "pivotSize",
  title = "Relative Abundance of Bacterial Phylums per Disease Status",
  fontsize.title = 24,
  fontsize.labels = 14
)

In this treemap:

  • The outer rectangles represent the different disease statuses
  • The inner rectangles represent the different bacterial phyla
  • The size of each rectangle is proportional to the relative abundance of that phylum in that disease status

From this visualization, we can see that:

  • Firmicutes and Bacteroidetes are the dominant phyla across all disease statuses
  • There are some differences in the relative abundances of these phyla across disease statuses
  • Most other phyla have very low relative abundances, making them difficult to see in the treemap

This is why we have pooled rarer phyla into an “Other” category for our subsequent visualizations.

Barplots

Barplots are versatile visualizations that can be used to compare quantities across categories. For parts of a whole comparisons, two common types of barplots are stacked barplots and grouped barplots.

Stacked Barplots

Stacked barplots show the composition of different groups across categories. They are particularly useful when you want to show the proportion of different components within each category, especially when proportions add up to 100%.

In ecology, stacked barplots are commonly used to visualize:

  • Species composition across different sites or time points
  • Taxonomic composition across different samples or treatments
  • Functional group composition across different ecosystems

Let’s create a stacked barplot to visualize the mean relative abundance of bacterial phyla across different disease statuses. In `ggplot2`, we use the `geom_bar()` function to create a stacked barplot. Within this function, we specify the `stat = “identity”` argument to ensure that the y-axis values are the actual mean relative abundances. If we did not specify this argument, the y-axis would represent the count of each phylum, which is not what we want. There is an argument within `geom_bar` called `position` which is what controls whether the bars are stacked or grouped. Since the default is to stack the bars, we do not need to specify this argument:

df_rel_abund_phylum_pooled %>%
  ggplot(aes(x = disease_status, y = mean_rel_abund, fill = phylum)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Relative Abundance of Bacterial Phylums per Disease Status",
    x = "Disease Status",
    y = "Mean Relative Abundance (%)",
    fill = "Phylum"
  ) +
  scale_fill_manual(
    values = c(
      "Firmicutes" = "blue",
      "Bacteroidetes" = "red",
      "Other" = "grey"
    )
  ) +
  ggthemes::theme_base()

Grouped Barplots

Grouped barplots are meant for more direct comparisons of counts across different groups, without particular emphasis on showcasing proportions adding up to 100%.

In ecology, grouped barplots are commonly used to visualize the abundance of different species or groups across different sites or conditions.

SUPER IMPORTANT: Never use a grouped barplot as a substitute for boxplots. Boxplots are the better choice in showing the distribution of a continous outcome variable and comparing this outcome variable across different groups. Grouped barplots are better for count data.

Let’s create a grouped barplot to compare the mean relative abundance of bacterial phyla across different disease statuses. In `ggplot2`, we can use the `position = “dodge”` argument to create a grouped barplot:

df_rel_abund_phylum_pooled %>%
  ggplot(aes(x = phylum, y = mean_rel_abund, fill = disease_status)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_manual(
    values = c(
      "Normal" = "blue",
      "High Risk Normal" = "green",
      "Adenoma" = "yellow",
      "Adv Adenoma" = "orange", "Cancer" = "red"
    )
  ) +
  labs(
    title = "Mean Relative Abundance of Bacterial Phylums per Disease Status",
    x = "Phylum",
    y = "Mean Relative Abundance (%)",
    fill = "Disease Status"
  ) +
  ggthemes::theme_base()

Conclusion

In this article, we’ve explored three common visualization techniques for parts of a whole comparisons in ecology: treemaps, stacked barplots, and grouped barplots. We’ve demonstrated these techniques using a microbiome dataset from the human gut, showing how the composition of bacterial phyla varies across different disease statuses.

Each visualization technique has its strengths:

  • Treemaps are excellent for visualizing hierarchical data and showing proportions through nested rectangles.
  • Stacked barplots are great for showing the proportions of different components within each category, especially when proportions add up to 100%.
  • Grouped barplots are ideal for comparing the same components across different groups.

There are many other visualization techniques that can be used for parts of a whole comparisons in ecology, including:

  • Heatmaps: Show the abundance of components across samples using color intensity
  • Scatter plots: Show the relationship between different components
  • Dendrograms: Show hierarchical relationships between samples based on their composition

The choice of visualization technique depends on the specific question you’re trying to answer and the characteristics of your data. By using a combination of these techniques, you can gain a comprehensive understanding of the composition and structure of ecological communities.