Ecological Diversity, Rarefaction, and Dissimilarity

When studying ecological communities, we often need to answer questions like: How diverse is this community? How does it compare to other communities? And how do we account for differences in sampling effort? This article introduces the `vegan` package, a powerful tool for answering these questions through ecological diversity analysis, rarefaction, and dissimilarity measures.

Introduction

Understanding ecological communities requires us to quantify and compare their diversity. Whether you’re studying microbial communities in the gut, plant communities in a forest, or animal communities in a coral reef, you’ll need tools to measure diversity, account for sampling effort, and compare communities. The `vegan` package in R provides a comprehensive suite of functions for these analyses.

The vegan package is one of the most widely used packages in ecological data analysis. It provides functions for diversity analysis, ordination, and hypothesis testing. In this article, we’ll focus on three key aspects: measuring diversity, rarefaction (to account for sampling effort), and calculating dissimilarities between communities.

We’ll be working with a dataset from Väre, H., Ohtonen, R. and Oksanen, J. (1995) that examines the effects of reindeer grazing on understorey vegetation in dry Pinus sylvestris forests. This dataset is particularly interesting because it allows us to explore how grazing pressure affects plant community composition and diversity.

The vegan package expects data to be organized in two main components:

community data (species counts or coverages)
environmental data (metadata about the samples).

Let’s first load in the species data:

data(varespec) # Species data
df_varespec <- varespec %>%
  as_tibble()
str(df_varespec)

tibble [24 × 44] (S3: tbl_df/tbl/data.frame)
 $ Callvulg: num [1:24] 0.55 0.67 0.1 0 0 ...
 $ Empenigr: num [1:24] 11.13 0.17 1.55 15.13 12.68 ...
 $ Rhodtome: num [1:24] 0 0 0 2.42 0 0 1.55 0 0.35 0.07 ...
 $ Vaccmyrt: num [1:24] 0 0.35 0 5.92 0 ...
 $ Vaccviti: num [1:24] 17.8 12.1 13.5 16 23.7 ...
 $ Pinusylv: num [1:24] 0.07 0.12 0.25 0 0.03 0.12 0.1 0.1 0.05 0.12 ...
 $ Descflex: num [1:24] 0 0 0 3.7 0 0.02 0.78 0 0.4 0 ...
 $ Betupube: num [1:24] 0 0 0 0 0 0 0.02 0 0 0 ...
 $ Vacculig: num [1:24] 1.6 0 0 1.12 0 0 2 0 0.2 0 ...
 $ Diphcomp: num [1:24] 2.07 0 0 0 0 0 0 0 0 0.07 ...

Followed by the environmental data:

data(varechem) # Environmental data
samples_grazed <- c(2, 3, 4, 5, 7, 8, 9, 11, 12, 13, 24)
samples_ungrazed <- c(1, 6, 10, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
df_varechem <- varechem %>%
  as_tibble() %>%
  # Add grazing treatment as a factor
  mutate(
    grazing = factor(
      c(
        samples_grazed,
        samples_ungrazed
      ),
      labels = c(
        rep("grazed", length(samples_grazed)),
        rep("ungrazed", length(samples_ungrazed))
      )
    )
  )
df_varechem

# A tibble: 24 × 15
       N     P     K    Ca    Mg     S    Al    Fe    Mn    Zn    Mo Baresoil Humdepth    pH grazing 
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl> <fct>   
 1  19.8  42.1  140.  519.  90    32.3  39    40.9  58.1   4.5   0.3     43.9      2.2   2.7 ungrazed
 2  13.4  39.1  167.  357.  70.7  35.2  88.1  39    52.4   5.4   0.3     23.6      2.2   2.8 grazed  
 3  20.2  67.7  207.  973. 209.   58.1 138    35.4  32.1  16.8   0.8     21.2      2     3   grazed  
 4  20.6  60.8  234.  834  127.   40.7  15.4   4.4 132    10.7   0.2     18.7      2.9   2.8 grazed  
 5  23.8  54.5  181.  777  126.   39.5  24.2   3    50.1   6.6   0.3     46        3     2.7 grazed  
 6  22.8  40.9  171.  692. 151.   40.8 105.   17.6  43.6   9.1   0.4     40.5      3.8   2.7 ungrazed
 7  26.6  36.7  171.  739.  94.9  33.8  20.7   2.5  77.6   7.4   0.3     23        2.8   2.8 grazed  
 8  24.2  31    138.  395.  45.3  27.1  74.2   9.8  24.4   5.2   0.3     29.8      2     2.8 grazed  
 9  29.8  73.5  260   749. 105.   42.5  17.9   2.4 107.    9.3   0.3     17.6      3     2.8 grazed  
10  28.1  40.5  314.  541. 119.   60.2 330.  110.   61.7   9.1   0.5     29.9      2.2   2.8 ungrazed
# ℹ 14 more rows

Diversity

Ecological diversity is a fundamental concept in community ecology. It’s not just about counting species – it’s about understanding how species are distributed within a community. The vegan package provides several ways to measure diversity, each capturing different aspects of community structure.

Species richness (S) is the simplest measure – it’s just the count of species in a community.

\text S = n_species

But this doesn’t tell us anything about how relatively abundant each species is, also known as species evenness. For that we look to various diversity indices.

The Shannon Index measures the uncertainty in predicting the species identity of a randomly chosen individual from a dataset. It accounts for both richness and evenness.

\text H = -\sum (p_i \cdot \ln(p_i))

where p_i is the proportion of individuals belonging to the i-th species.

Another common diversity index is the Simpson diversity index (D), which measures the probability that two individuals randomly selected from a sample will belong to the same species. It emphasizes dominance—more weight is given to common species. The Simpson index is defined as:

\text D = 1 - \sum (p_i^2)

as before, where p_i is the proportion of individuals belonging to the i-th species.

Finally, Pielou’s evenness (J) specifically measures how evenly individuals are distributed across species. It is derived from the Shannon Index and normalized by the maximum possible diversity (if all species were equally abundant):

\text J = \frac{\text H}{\log(\text S)}

It ranges from 0 to 1, where 0 indicates a community with only one species and 1 indicates a community with equal abundances of all species.

In vegan, we can trivially calculate these diversity indices using the specnumber and diversity functions:

df_varespec_diversity <- tibble(
  S = vegan::specnumber(df_varespec), # Species richness
  H = vegan::diversity(df_varespec), # Shannon diversity index
  D = vegan::diversity(df_varespec, index = "simpson"), # Simpson diversity index
  J = H / log(S) # Pielou evenness
)
df_varespec_diversity

# A tibble: 24 × 4
       S     H     D     J
   <int> <dbl> <dbl> <dbl>
 1    29  2.02 0.822 0.599
 2    26  1.84 0.763 0.564
 3    23  1.83 0.781 0.585
 4    25  1.87 0.744 0.581
 5    26  2.15 0.841 0.659
 6    28  1.98 0.818 0.594
 7    27  2.01 0.803 0.611
 8    27  2.04 0.825 0.619
 9    27  1.24 0.560 0.376
10    31  1.97 0.818 0.573
# ℹ 14 more rows

Let’s see if grazing has a significant effect on species richness and Shannon diversity. We’ll use a sample one-way ANOVA to test this.

df_varespec_diversity %>%
  bind_cols(df_varechem %>% select(grazing)) %>%
  aov(S ~ grazing, data = .) %>%
  summary()

            Df Sum Sq Mean Sq F value Pr(>F)
grazing      1  15.41  15.414   1.744    0.2
Residuals   22 194.42   8.837

df_varespec_diversity %>%
  bind_cols(df_varechem %>% select(grazing)) %>%
  aov(H ~ grazing, data = .) %>%
  summary()

            Df Sum Sq Mean Sq F value  Pr(>F)   
grazing      1 0.8838  0.8838   8.991 0.00662 **
Residuals   22 2.1625  0.0983                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interesting! So while grazing has not affected the overall counts for species (p = 0.2), there is an indication that diversity decreases as a function of grazing (p = 0.0066).

Let’s plot these outcomes as well!

df_varespec_diversity %>%
  bind_cols(df_varechem %>% select(grazing)) %>%
  pivot_longer(cols = H:J, names_to = "Index", values_to = "Diversity") %>%
  ggplot(aes(x = grazing, y = Diversity)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(alpha = 0.5) +
  labs(x = "Grazing", y = "Diversity") +
  facet_wrap(
    ~Index,
    scales = "free",
    labeller = labeller(Index = c(
      S = "Species Richness",
      H = "Shannon",
      D = "Simpson",
      J = "Pielou's Evenness"
    ))
  ) +
  ggthemes::theme_base()

Which appear to recapitulate our statistical tests and suggest that we may be looking at a case of overgrazing pressure which might lead to a few species becoming more dominant while others become rarer, reducing evenness without necessarily changing the total number of species present.

Renyi Diversity

Renyi diversity provides a unified framework for understanding many common diversity indices. Rather than calculating a single value, Renyi diversity calculates diversity across a range of scales, giving us a more complete picture of community structure.

The Renyi diversity index is parameterized by a scale parameter alpha, and is defined as:

H_{\alpha} = \frac{1}{1-\alpha} \log \left( \sum_{i=1}^{S} p_i^\alpha \right)

Where p_i represents the proportional abundance of species i. The value of alpha determines how the index treats rare versus common species:

When alpha = 0, the index equals the logarithm of species richness (weights all species equally)
As alpha approaches 1, the index approaches the Shannon diversity index (weights species by their frequency, but still gives rare species significant weight)
When alpha = 2, the index equals the logarithm of the reciprocal Simpson index (gives more weight to common species)
As alpha approaches infinity, the index is determined entirely by the most abundant species

This range of alpha values allows us to see how diversity changes as we give more or less weight to rare species. It’s particularly useful for comparing communities that might have the same value for one diversity index but different values for others.

The Renyi diversity index is implemented in vegan through the renyi function, which returns a specialized data structure that can be directly visualized:

varspec_renyi <- vegan::renyi(df_varespec)
varspec_renyi

         0     0.25      0.5        1        2        4         8        16        32        64       Inf
1 3.367296 2.869202 2.470812 2.017763 1.724352 1.580836 1.5068766 1.4647490 1.4409678 1.4285651 1.4121872
2 3.258097 2.751645 2.336499 1.837101 1.438664 1.145838 0.9906146 0.9246103 0.8947842 0.8805813 0.8668222
3 3.135494 2.648179 2.267373 1.834652 1.518735 1.316793 1.1924340 1.1212539 1.0853630 1.0681356 1.0514460
4 3.218876 2.726105 2.353940 1.871294 1.363111 1.025080 0.8817439 0.8229682 0.7964208 0.7837792 0.7715327
5 3.258097 2.823251 2.510723 2.146992 1.839353 1.632819 1.5022427 1.4247306 1.3812816 1.3594084 1.3381677
6 3.332205 2.802796 2.400430 1.980967 1.704811 1.521643 1.4151268 1.3648125 1.3389827 1.3231019 1.3031047

Looking at the output, we can see that Renyi diversity is calculated for each sample across various alpha values (from 0 to infinity). The renyi function returns an object with class "renyi", which has its own plotting method that produces an informative visualization:

plot(varspec_renyi)

The resulting plot shows Renyi diversity profiles for each site. The x-axis represents the scale parameter alpha, while the y-axis shows the corresponding diversity value. These profiles help us understand how diversity varies across different scales. The confidence intervals and median lines are reflective of all samples and help to identify which particular sites are driving significant deviations from the overall pattern.

In our dataset, we can see that certain sites such as sample 19 and 20 feature consistent lower diversity scores across all alpha values. Recall, these were impacted by grazing.

Rarefaction

A common challenge in ecological studies is comparing diversity across samples with different sampling efforts. Sampling effort refers to the amount of resources (time, number of samples, area sampled, etc.) used to collect the data. Samples with more individuals or greater sampling effort will naturally tend to contain more species, making direct comparisons misleading. Rarefaction addresses this issue by asking: if we had collected fewer samples, how many species would we expect to find? This will downweight the influence of samples collected with greater sampling effort.

The mathematical formula for rarefaction calculates the expected number of species in a random subsample of size n drawn from a larger sample of size N:

E(S_n) = \sum_{i=1}^{S} \left[ 1 - \frac{\binom{N-N_i}{n}}{\binom{N}{n}} \right]

Where:

E(S_n) is the expected number of species in a sample of size n
S is the total number of species in the complete sample
N_i is the number of individuals of species i in the complete sample
N is the total number of individuals in the complete sample

This formula essentially calculates the probability of each species being included in the smaller subsample and sums these probabilities to estimate the expected number of species.

Basic Rarefaction

Rarefaction in vegan requires count data (integers representing the number of individuals), so we’ll switch to the Barro Colorado Island (BCI) dataset, which contains tree counts from 50 one-hectare plots.

data(BCI)
df_bci <- BCI %>%
  as_tibble()
str(df_bci)

tibble [50 × 225] (S3: tbl_df/tbl/data.frame)
 $ Abarema.macradenia              : int [1:50] 0 0 0 0 0 0 0 0 0 1 ...
 $ Vachellia.melanoceras           : int [1:50] 0 0 0 0 0 0 0 0 0 0 ...
 $ Acalypha.diversifolia           : int [1:50] 0 0 0 0 0 0 0 0 0 0 ...
 $ Acalypha.macrostachya           : int [1:50] 0 0 0 0 0 0 0 0 0 0 ...
 $ Adelia.triloba                  : int [1:50] 0 0 0 3 1 0 0 0 5 0 ...
 $ Aegiphila.panamensis            : int [1:50] 0 0 0 0 1 0 1 0 0 1 ...
 $ Alchornea.costaricensis         : int [1:50] 2 1 2 18 3 2 0 2 2 2 ...

There are numerous approaches to selecting the correct baseline for rarefaction. In our case, and that which is most commonly used, is to rarefy to the smallest sample size that is not likely an outlier. Let’s first get the quantiles for the sample sizes (row sums of the species table).

rs <- rowSums(df_bci)
quantile(rs)

   0%   25%   50%   75%  100% 
340.0 409.0 428.0 443.5 601.0

These are all substantial numbers of counts so there is no concern about trimming off bad samples. Therefore, we can proceed to use the smallest sample size as our baseline for rarefaction.

Let’s rarefy to this sample size.

bci_rarefied <- vegan::rarefy(
  df_bci,
  sample = min(rs) # rarefy to the smallest sample size
)
bci_rarefied[1:10]

 [1] 84.33992 76.53165 79.11504 82.46571 86.90901 78.50953 76.34768 81.88136 83.26880 81.97148

The resulting values represent the rarefied species richness for each plot. These values corrected for the sampling effort can now be used in subsequent analyses which we will cover in future tutorials.

All that said, rarefaction, like normalization, does remove information from the data. Rightfully, then, it can come under critique depending on the sub-discipline of interest. Therefore, as with anything in life, it is often worthwhile to plan (i.e. design your study carefully) before taking action (collecting data) to avoid this issue in the first place.

Rarefaction Curves

Speaking of which, one excellent approach to excellent ecological study design, assuming you have a small prototype/pilot dataset, is to use rarefaction curves. Think of these as power analyses for showcasing the sampling effort that would be required per sample to detect a certain number of species.

The rarecurve function in vegan makes it easy to generate these curves:

vegan::rarecurve(
  df_bci,
  label = TRUE,
  sample = min(rs)
)

In the resulting plot, each curve represents a different plot from the BCI dataset. The x-axis shows the sample size (number of individuals), while the y-axis shows the expected number of species. The vertical line indicates the minimum sample size across all plots.

Looking at these curves, we can observe that:

All curves show the characteristic shape of initially rapid species accumulation followed by a gradual flattening
Most curves have not reached a clear asymptote, suggesting that additional sampling would reveal more species

Dissimilarity

While diversity indices tell us about the richness and evenness within individual communities, dissimilarity measures quantify how different communities are from each other. These measures are fundamental to many multivariate analyses in ecology, including ordination, cluster analysis, and hypothesis testing.

Dissimilarity (also known as “distance” in some other scientific fields) measures range from 0 (identical communities) to some maximum value, which depends on the measure used. Typically it will be either 1 or infinity.

Euclidean Distance

Euclidean distance is the straight-line distance in multidimensional space:

d_{ij} = \sqrt{\sum_{k} (x_{ik} - x_{jk})^2}

While commonly used in many fields, Euclidean distance often performs poorly with ecological data unless appropriately transformed, as it’s sensitive to double-zeros and differences in sampling effort. It can extend to infinity to indicate complete dissimilarity.

Bray-Curtis Dissimilarity

Bray-Curtis dissimilarity quantifies the compositional dissimilarity between two communities based on counts at each site:

BC_{ij} = 1 - \frac{2 \sum_{k} \min(x_{ik}, x_{jk})}{\sum_{k} (x_{ik} + x_{jk})}

Where x_ik and x_jk are the counts of species k in samples i and j. This measure is sensitive to both species presence/absence and abundances, making it popular for community composition studies. A distance of 1 indicates that the two samples are completely different.

Jaccard Dissimilarity

Jaccard dissimilarity focuses on species presence/absence, ignoring abundances:

J_{ij} = 1 - \frac{|A_i \cap A_j|}{|A_i \cup A_j|}

Where A_i and A_j are the sets of species in samples i and j. This measure is useful when species identities are more important than their abundances. As with Bray-Curtis, a distance of 1 indicates that the two samples are completely different.

Basic Dissimilarity Calculation in `vegan`

The vegdist function in vegan makes it easy to calculate dissimilarity matrices using various measures. Let’s calculate Bray-Curtis dissimilarity for our vegetation dataset:

dist_varespec_bray <- vegdist(df_varespec, method = "bray")
class(dist_varespec_bray)

[1] "dist"

The result is a distance matrix object, which is a specialized data structure used in R for storing pairwise distances between samples. This object can be directly used in many multivariate analyses, such as cluster analysis and ordination.

Transformations and Standardization

Before calculating dissimilarities, it’s often necessary to transform or standardize the data to address issues like differences in sampling effort, skewed distributions, or the influence of abundant species. The decostand function in vegan offers various transformation methods.

One particularly useful transformation for community data is the Hellinger transformation, which divides the species abundances by the sample total, then takes the square root. This transformation gives less weight to abundant species and addresses the “double-zero” problem (where two samples share the absence of many species, which shouldn’t necessarily make them similar).

Here’s an example of applying the Hellinger transformation before calculating Euclidean distances:

df_varespec %>%
  vegan::decostand(method = "hellinger") %>%
  vegdist(method = "euclidean") %>%
  as.matrix() %>%
  `[`(1:5, 1:5) # Special notation for subsetting a matrix

          1         2         3         4         5
1 0.0000000 0.6885085 0.8832262 0.7910534 0.5081880
2 0.6885085 0.0000000 0.6510691 0.6375933 0.5086377
3 0.8832262 0.6510691 0.0000000 0.7940364 0.7408735
4 0.7910534 0.6375933 0.7940364 0.0000000 0.6271162
5 0.5081880 0.5086377 0.7408735 0.6271162 0.0000000

This approach—Hellinger transformation followed by Euclidean distance— is a popular choice for ordination methods like Principal Component Analysis (PCA) when working with community data. PCA will be covered in the subsequent tutorial on unconstrained ordination methods.

Plotting Dissimilarities

Dissimilarity matrices can be visualized using various techniques, with hierarchical clustering (dendrograms) being one of the most common. Dendrograms group samples based on their dissimilarity, with closely related samples clustered together.

To create a dendrogram, we first calculate the dissimilarity matrix, then perform hierarchical clustering using a method like Ward’s method, which minimizes the variance within clusters:

df_varespec %>%
  vegan::decostand(method = "hellinger") %>%
  vegdist(method = "euclidean") %>%
  hclust(method = "ward.D2") %>%
  plot(ylab = "Euclidean Distance", xlab = "Samples")

In this dendrogram, each leaf represents a sample, and the height of the connecting branches indicates the dissimilarity between samples or clusters. Samples that are more similar to each other are connected at lower heights, while more dissimilar samples or clusters join at greater heights.

The clustering pattern in the dendrogram can reveal natural groupings in the data and relationships between samples. For instance, we might see samples from similar habitats or with similar environmental conditions clustered together, suggesting ecological similarity.

Conclusion

In this article, we’ve explored three fundamental concepts in community ecology: diversity indices, rarefaction, and dissimilarity measures. The vegan package provides powerful tools for analyzing these aspects of ecological communities.

Through diversity indices, we can quantify different facets of community structure, from simple species richness to more complex measures that incorporate species abundances. Rarefaction allows us to compare communities with different sampling efforts by estimating diversity at a standardized sample size. Dissimilarity measures quantify differences between communities, forming the basis for multivariate analyses.

In the next article, we’ll build on these foundations to explore ordination techniques and hypothesis testing using permutation tests, which allow us to visualize community patterns and test ecological hypotheses about community structure and environmental relationships.

Prev Differential Gene Expression Next Unconstrained Ordination Techniques

R Manual Ecology &
Evolutionary Biology

Ecological Diversity, Rarefaction and Dissimilarity

EEB Analyses

Introduction

Diversity

Renyi Diversity

Rarefaction

Basic Rarefaction

Rarefaction Curves

Dissimilarity

Euclidean Distance

Bray-Curtis Dissimilarity

Jaccard Dissimilarity

Basic Dissimilarity Calculation in `vegan`

Transformations and Standardization

Plotting Dissimilarities

Conclusion

EEB Analyses

Introduction

Diversity

Renyi Diversity

Rarefaction

Basic Rarefaction

Rarefaction Curves

Dissimilarity

Euclidean Distance

Bray-Curtis Dissimilarity

Jaccard Dissimilarity

Basic Dissimilarity Calculation in vegan

Transformations and Standardization

Plotting Dissimilarities

Conclusion

Basic Dissimilarity Calculation in `vegan`