Skip to content

Analysis of Variance

In our discussion of comparing sample means we discussed how to use R to compare two sample means. This is usually done through a t-test of some form. However, it’s not uncommon that we start asking questions wherein we want to compare the sample means across a larger number of groups. For example, say we have four groups 1-4.

This brings us to a conundrum, as statistical theory has to re-consider what to do to be able to formulate our null and alternate hypotheses in such a way that is useful. If we want to compare the means of four groups of a single grouping variable, we will use a single-factor ANOVA to do so. Our task here is to figure out how much variation among the group means should be present if the null hypothesis is true.

To be clear, when we are discussing this one-factor ANOVA, we are stating that our null hypothesis is that there is no difference between the sample means, so \(H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4\). In our alternative hypothesis then, is that at least one mean is different.

F-statistic

Just like when we did t-tests and had the t-statistic, our test-statistic for the one-factor ANOVA is the F-statistic. Since our interest here is variance, the F-statistic is best thought of as a ratio of two variances. We will think the variances in the terms of mean squares. We first have the group mean square which we calculate as the error sum of squares and dividing by the degrees of freedom: ​

MSG = \frac{\sum{(y_i - \hat{y_i})^2}}{n-2}

and then we have the mean square error which we calculate by summing the error sum of squares and dividing by the associated degrees of freedom:

MSE = \frac{\sum{(y_i - \hat{y_i})^2}}{n-2}

These two quantities can be thought of as representing the variance among group means (MSG) and the variance between subjects in the same group (MSE). Thus, it may be somewhat logical then, that since the F-statistic is the simple ratio

F = \frac{MSG}{MSE}

, if the ratio is equal to 1, then the variances are in fact the same, and there is no additional variance between the sample means (represented by MSG) compared to just the error within groups (represented by MSE).

We then use our p-value as usual to determine the probability of getting our present F-statistic by chance. Then we can reject or fail to reject the null hypothesis.

Performing the ANOVA

When performing the ANOVA, similar to all statistical tests there are some processes that need to take place before we actually run the test. We first gather and organize our data as need be. Then we must check that our data meet the assumptions of the test we are planning on carrying out, then we can perform our test and interpret.

For our example here, let’s consider the sizes of penguins bills.

library(tidyverse)
library(palmerpenguins)

df <- palmerpenguins::penguins
names(df)
## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"       "sex"              
## [8] "year"

In our example here, we may hypothesize that between the species in our dataset, there may be differences in the mean bill length. Bill length in a penguin looks vaguely like this:

Artwork by @allison_horst

So first we’ll look at the species on hand

unique(df$species)
## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

While this is not a strict requirement, it’s probably best to check that each species has ~roughly the same amount of samples per group before we continue. Let’s use the table() function for this:

table(df$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

We can also plot this:

library(ggthemes)

# make grouping variable a factor
df$species <- as.factor(df$species)

ggplot() +
  geom_bar(data = df, mapping = aes(x = species, fill = species),
          colour = "black") +
  ggthemes::theme_base()

Okay so we can see that while they’re not completely equal, they’re close enough for our purposes, especially with an ANOVA, as it’s one of the less sensitive tests to number of samples per group.

Assumptions

The assumptions of this one-factor ANOVA are:

  1. 1) Observations are independent and random
  2. 2) Data in each level of the groupings are normally distributed
  3. 3) The populations have common variances

Normality

A note: while plotting our data can be helpful, it can often lead to p-hacking and HARKing (Hypothesizing After Results are Known). We often plot the our data before as part of our data exploration, but that will give us an insight, prior to performing our tests, about what the means of each group might be. Again, this in and of itself is not bad but it is important to recall that our process here should be to develop our hypothesis and then perform our statistical test. Here we’ll use QQ plots to check for normality in each of our groups:

library(ggpubr)

# first filter the data
adelie <- df %>%
  dplyr::filter(species == "Adelie")
chin <- df %>%
  dplyr::filter(species == "Chinstrap")
gentoo <- df %>%
  dplyr::filter(species == "Gentoo")

# plot qq-plots for each one
ggpubr::ggqqplot(adelie$bill_length_mm, main = "Adelie")
## Warning: Removed 1 rows containing non-finite values (`stat_qq()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ggpubr::ggqqplot(chin$bill_length_mm, main = "Chinstrap")
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ggpubr::ggqqplot(gentoo$bill_length_mm, main = "Gentoo")
## Warning: Removed 1 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

Visually, all of these look mostly okay, but the Gentoo one specifically looks as though it might be a bit skewed. That is, some of the points on the right end of the plot come close to/go over the shaded bounds. To be abundantly conservative and cautious, we could interpret this as non-normality.

Two things to note here: First, ANOVAs are actually relatively robust to violations of normality assumptions. This means that given we can see the data are not that far off of a normal distribution (we can see this via our QQ plot), we could in theory proceed with our test here with relative confidence. Second, log-transforming our data is valid for an ANOVA. This is helpful for us, as while we could go ahead with the test (after checking our other assumptions), it’s still true that there is an easy transformation available to us through the log transformation, that prevents us from having to technically violate the assumptions of our test.

When we transform data, all we are really doing is changing each value in our data by the same mathematical formula, in this case, the natural logarithm, \(ln\). For us, the logarithmic function can be useful as it will help deal with skew in our data, or in other cases, the chance that the data at hand may span many orders of magnitude. However, we still need to check that our log-transformed data meet the assumptions of our test before we continue, especially since log-transformation is not guaranteed to ALWAYS fix problems of normality.

An additional note is that if our assumptions are not met completely, and perhaps a transformation does not help us, it is possible to use a non-parametric alternative such as the Kruskal-Wallis test which we do not cover in this section, but is similarly easy to run.

So let’s go ahead and transform our data:

adelie <- adelie %>%
  dplyr::mutate(
    log_length = log(bill_length_mm)
  )
chin <- chin %>%
  dplyr::mutate(
    log_length = log(bill_length_mm)
  )
gentoo <- gentoo %>%
  dplyr::mutate(
    log_length = log(bill_length_mm)
  )

Now we’ll go ahead and re-do our QQ plots:

# plot qq-plots for each one
ggpubr::ggqqplot(adelie$log_length, main = "Adelie")
## Warning: Removed 1 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ggpubr::ggqqplot(chin$log_length, main = "Chinstrap")
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ggpubr::ggqqplot(gentoo$log_length, main = "Gentoo")
## Warning: Removed 1 rows containing non-finite values (`stat_qq()`).
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## Warning: Removed 1 rows containing non-finite values (`stat_qq_line()`).
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

Ok great, so we can now see that each of the three looks good, so we can assume normality for these values.

Homoskedasticity

Next to check our variances.

Our second assumption was that the variances of all samples be equal. While it’s not necessary to always do so, some researchers will use a standardized test to check for homoskedasticity. We will test our homoskedasticity with the Bartlett’s test. The null hypothesis of the F-test is that the variances are equal, so if p < 0.05 then for our purposes we could say that the assumption of homoskedasticity is violated. As a sidenote, if our assumptions of normality were not quite met, it would be preferable to use the Levene’s test which is less sensitive to this problem.

To do the Barlett’s test let’s use the stats package function bartlett.test(). We will now actually re-format our data first so that it’s in long form. This will make our lives easier. To do this, we want one column with the actual values, and another with the grouping variable (in this case, our species names). To get the species names, what we’ll do is combine a few arguments such that we can do it all in one command. We’ll use the rep() function which takes two arguments: a) the value to be repeated, and b) the number of times to repeat it. Because there are different numbers of observations for each species, the second argument in each call to rep() will simply be the length of the vector of the log_length for that species. Also, we want the grouping variable to be a factor. That looks like this:

bill_data <- data.frame(
  log_length = c(adelie$log_length, chin$log_length, gentoo$log_length),
  species = as.factor(c(
    # first for the adelie species since it's the first in the vector above
    rep("adelie", length(adelie$log_length)),
    rep("chin", length(chin$log_length)),
    rep("gentoo", length(gentoo$log_length))
  ))
)

# let's look at the structure of the df
str(bill_data)
## 'data.frame':	344 obs. of  2 variables:
##  $ log_length: num  3.67 3.68 3.7 NA 3.6 ...
##  $ species   : Factor w/ 3 levels "adelie","chin",..: 1 1 1 1 1 1 1 1 1 1 ...

Great, now for our Bartlett’s test. The first argument is the column that has the values of interest, and we use the tilde (~) to denote after the column of interest, the name of the grouping column. Last, tell the function what the dataframe itself is called.

stats::bartlett.test(log_length ~ species, data = bill_data)
## 
## 	Bartlett test of homogeneity of variances
## 
## data:  log_length by species
## Bartlett's K-squared = 0.80327, df = 2, p-value = 0.6692

Wonderful, and we see that the p-value is well above 0.05, so our variances are equal enough to continue.

Performing the Test

Now to perform the test itself. It is, similarly to the t-test, very easy to compute in R, and looks pretty much identical to our syntax for the Bartlett’s test. The first argument is the column that has the values of interest, and we use the tilde (~) to denote after the column of interest, the name of the grouping column. The function itself we use comes from the stats package again. We can now proceed to our test and run it all like this:

mod <- stats::aov(log_length ~ species, data = bill_data)
summary(mod)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## species       2  3.846  1.9230   427.6 <2e-16 ***
## Residuals   339  1.525  0.0045                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

So we see our p-value is very small, which indicates we can reject the null hypothesis – there is a difference between the means of the species.

Interpretation

There are still a few steps we might want to take. First, we still don’t know what the actual means of the groups are, nor which are different than each other. Also, it’s almost always advantageous to plot our output so let’s do that now.

Tukey Comparison

To actually see what the pairwise comparison of the groups are, we can see which groups are actually different than one another. This is useful as it allows us to identify if it’s only one group that is different, (i.e. $\mu_1 = \mu_2$ but $\mu_3 > \mu_2$) or it they’re all quite a bit different. We can do this comparison through the Tukey test (Tukey Honest Signfiicant Differences) which again can be done using the stats package:

stats::TukeyHSD(mod)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: stats::aov(formula = log_length ~ species, data = bill_data)
## 
## $species
##                      diff         lwr          upr    p adj
## chin-adelie    0.23023589  0.20718014  0.253291643 0.000000
## gentoo-adelie  0.20292758  0.18375262  0.222102538 0.000000
## gentoo-chin   -0.02730831 -0.05116498 -0.003451639 0.020178

So here we get four key outputs: diff which is the difference between the two groups at hand (this is read as the chinstraps are ~0.23 greater than adelies for the first row), the upper and lower bounds of the 95% confidence interval, and the p-value denoting if said difference is significant.

Plotting the Results

Now let’s plot our results. This can be done in any number of ways, including a boxplot, but I am personally keen on plotting the density of the values, in the same way we would if we wanted to use it to check assumptions of normality as it shows all components of the data at hand and doesn’t have the chance of hiding anything. I also think it’s just a useful plot to have. We’ll pair that with denotations of the means of each group as well as annotating the p-value onto the plot like this:

means <- bill_data %>%
  dplyr::group_by(species) %>%
  dplyr::summarise(means = mean(log_length, na.rm = TRUE))

ggplot(data = bill_data) +
  geom_density(aes(x = log_length, fill = species),
              alpha = 0.3, colour = "black") +
  ggthemes::theme_base() +
  labs(x = "Log Bill Length", y = "Density") +
  scale_fill_manual("Speices", values =
                      c("purple", "goldenrod2", "steelblue2")) +
  geom_vline(aes(
    xintercept = means$means[which(means$species == "adelie")]),
          linetype = "dashed", colour = "purple") +
  geom_vline(aes(
    xintercept = means$means[which(means$species == "chin")]),
          linetype = "dashed", colour = "goldenrod2") +
  geom_vline(aes(
    xintercept = means$means[which(means$species == "gentoo")]),
          linetype = "dashed", colour = "steelblue2") +
  annotate(geom = "text",
          x = 3.55,
          y = 5.9,
          label = "p-value << 0.001")
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).