Skip to content

Comparisons Between Sample Means

It’s often the case that we may sample independently from a population, and develop some hypothesis that may indicate that the means for the independent samples are in fact different from one another. Taking the example we noted in the hypothesis testing section, perhaps we care about the size of fiddler crabs, and if the crabs are from multiple sites, we could say that, between two sites of interest, we might believe there to be a size difference. Well, let’s perform a statistical test to see if there is support for this hypothesis.

It is generally best in statistics, especially when starting our our own individual journeys as statisticians, to choose the simplest test that A) will answer our question, and B) we can satisfy the assumptions of.

When comparing two sample means, the standard test to perform here is the two-sample t-test. The test statistic is calculated as:

t = \frac{(\bar{Y_1} - \bar{Y_2}) - (\mu_1 - \mu_2)_0}{SE_{Y_1 - Y_2}}

This test assumes three major things:

  1. Normality
  2. Homoskedasticity
  3. Independence

Our question is as it stands is simply that we believe there is a size difference between crabs from different locations. Let’s start by attempting to justify these assumptions, and if we can, using the simple test.

Two-sided T-tests

We’ll use data on crab sizes:

Now what? Well there are other tests available to us!! A reminder that parametric tests like the t-test, often have a non-parametric alternative, that usually relax assumptions of normality, and are good for data with small sample sizes. Here we have only 28 observations in each group, and non-normality so let’s turn to the non-parametric tests.

library(lterdatasampler)
crab_sizes <- lterdatasampler::pie_crab

There are a number of sites in this dataset, so let’s compare crabs from two northern sites and two southern sites. Our northern sites will be “Cape Cod” and “Narragansett Bay NERR” and our southern sites will be “Guana Tolomoto Matanzas NERR” and “Sapelo Island NERR”.

It’s important to never forget the underlying mathematics of what we’re doing, we should never simply rely on our computers exclusively and not understand what we’re doing, so let’s review the equations underlying the t-test. We’ll be calculating the test statistic \(t\) as described above. As we are testing a hypothesis that the sizes are different, we can state this formally as \(H_0: \mu_1 = \mu_2\) and \(H_1: \mu_1 \neq \mu_2\).

Since we stated we believe \(H_0: \mu_1 = \mu_2\), this is a two-sided test, we have not suggested that \(\mu_1 > \mu_2\) or \(\mu_1 < \mu_2\).

Assumptions of the Test

Before we proceed we need to take into account assumptions of our chosen test. We’ll assume our samples are independent (that is, that we’ve selected samples that aren’t related), but our two-sample t-test requires two other things: 1) normality, and 2) homoskedasticity.

Normality

To assess the normality of our data, let’s first assess it visually and then double-check with a statistical test of normality (the Shapiro-Wilks test). A convenient package for this task is the ggpubr package which allows to to plot density plots and QQ plots.

Let’s use a density plot to check if the data follow the normal/Gaussian curve:

library(ggpubr)
library(dplyr)

# first, cut our data to just the sites we want
crab_sizes_south <- crab_sizes %>%
  dplyr::filter(site %in% c("GTM", "SI"))
crab_sizes_north <- crab_sizes %>%
  dplyr::filter(site %in% c("CC", "NB"))

# plot our variable of interest (size)
ggpubr::ggdensity(crab_sizes_north$size, main = "Northern")
ggpubr::ggdensity(crab_sizes_south$size, main = "Southern")

They both look normal. Let’s look at the QQ plots:

ggpubr::ggqqplot(crab_sizes_north$size, main = "North")
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ggpubr::ggqqplot(crab_sizes_south$size, main = "South")
## Warning: The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
## The following aesthetics were dropped during statistical transformation: sample
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

We’re looking for the points to follow the black line and for no points to be outside the grey error bars. Both of these look pretty good. It’s worth noting that while for our purposes we can simply use plots to check for normality, it can sometimes be helpful to use a statistical test to check for normality. Since we’re using a t-test, this isn’t necessary as they can be robust to some degree of non-normality, but if strict normality is required, it’s possible to use a Shapiro-Wilk’s test. A Shapiro-Wilk’s test is a significance test that has a null hypothesis of normality. If the test is SIGNIFICANT then the test has indicated there is non-normality in the data..

We can do this in R very easily. The function shapiro.test() comes shipped in core R with the stats package. First for the northern sites:

stats::shapiro.test(crab_sizes_north$size)
## 
## 	Shapiro-Wilk normality test
## 
## data:  crab_sizes_north$size
## W = 0.99161, p-value = 0.9644

Okay, and we see that our p-value is > 0.05 so we’ll fail to reject the null. For the southern sites:

stats::shapiro.test(crab_sizes_south$size)
## 
## 	Shapiro-Wilk normality test
## 
## data:  crab_sizes_south$size
## W = 0.96787, p-value = 0.1269

Here we see a large p-value, >> 0.05, so again we’ll reject the null.

Good, we’ve satisfied the condition of normality.

Homoskedasticity

Our second assumption was that the variances of the two samples be equal. If they are not (heteroskedasticity), all is not lost and we can use a different version of the t-test. As with normality, it is usually sufficient to check this assumption with a plot. What we want to see is that the variance (recall that variance is simply a measure of how “spread out” a set of numbers are) is approximately equal for the two groups of interest. The word “homoskedastic” roughly translates to “having the same scatter” from the Greek root words, so we can use a scatter plot, a box plot, or a violin plot to look at this.

The test

So now we can actually perform the test. It’s quite simple, we’ll use the t.test() function:

t_test_res <- stats::t.test(
  # sample of group 1
  crab_sizes_north$size,
  # sample of group 2
  crab_sizes_south$size,
  # Regular t-test or Welch's t-test depending on the variance
  var.equal = TRUE
)
t_test_res
## 
## 	Two Sample t-test
## 
## data:  crab_sizes_north$size and crab_sizes_south$size
## t = 14.436, df = 112, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.089136 6.708352
## sample estimates:
## mean of x mean of y 
##  16.98357  11.08483

We see our result here, and it turns out the p-value is very small, so we can reject the null hypothesis, there is good evidence that $\mu_1 \neq \mu_2$. NOTE that if our data were heteroskedastic, then we could specify var.equal = FALSE and that would perform a Welch’s t-test which allows for this.

If we wanted, we could plot this result as a box plot, or perhaps more usefully, a violin plot (For a full walk through of how to iteratively make this plot, see our section on plotting Values Between Groups):

library(ggplot2)
library(ggthemes)

# first get one dataframe with the four sites we want
focal_sites <- crab_sizes %>%
  dplyr::filter(site %in% c("CC", "NB", "GTM", "SI")) %>%
  # make new column to denote our two groupings
  dplyr::mutate(
    group = as.factor(ifelse(
      # if the site is CC or NB, the value of `group` is "north" if not, "south"
      site %in% c("CC", "NB"), "North", "South"
    ))
  ) %>%
  # keep only the columns we need
  dplyr::select(size, group)

# plot the data 
ggplot(data = focal_sites) +
  geom_violin(aes(x = group, y = size, fill = group)) +
  stat_summary(aes(x = group, y = size),
              fun = "mean",
              geom = "crossbar",
              width = 0.5,
              colour = "black") +
  ggthemes::theme_base() +
  labs(y = "Size (mm)", x = "Geographic Region") +
  theme(
    legend.position = "none"
  ) +
  annotate(
    geom = "text",
    x = 2.1,
    y = 21,
    label = "p < 0.001",
    size = 8
  )

One-sided T-tests

In our above example we stated our hypothesis such that we could use a two-sided test. But what if when we were formulating our hypotheses, we’d said we believe the southern crabs would be bigger? Note that we’re pretending here that we HAVEN’T seen the results just above, since that would be p-hacking. But let’s say we thought “Hey, warmer water, lots of sun, maybe more beaches, that could mean bigger crabs.” (I stress this is not an intelligently derrived biological hypothesis).

But in this case, we could perform the exact test as above, but instead determine that we wanted a one-sided test. That’s easy enough and uses all the same code, with one additional argument. Using the same dataset, we would simply add an argument for alternative. The syntax here is saying that in the function, we pass two sets of numbers, x and y. If we believe x to be less than y, we pass “less”, and if we believe x to be greater than y, then we pass “greater”.

In our case, we are hypothesizing that the southern crabs (crab_sizes_south) are greater in size than the northern crabs, so let’s set x to be the crab_sizes_south and the y to be crab_sizes_north, and then alternative = "greater", so this is testing the alternate hypothesis that \(\mu_{south} > \mu_{north}\):

t_test_oneside_res <- stats::t.test(
  crab_sizes_south$size,
  crab_sizes_north$size,
  var.equal = TRUE,
  alternative = "greater"
)
t_test_oneside_res
## 
## 	Two Sample t-test
## 
## data:  crab_sizes_south$size and crab_sizes_north$size
## t = -14.436, df = 112, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -6.576453       Inf
## sample estimates:
## mean of x mean of y 
##  11.08483  16.98357

So we see here we have a p-value of 1. Given that we do know which group is actually bigger this comes as no surprise. Although this is a good reminder: Just because our one-sided test with an $H_A: \mu_{south} > \mu_{north}$ resulted in us failing to reject the null, this does NOT mean that we can assume that in fact (\mu_{south} > \mu_{north}) because the null hypothesis of these tests is stated that (H_0: \mu_{south} = \mu_{north})

What to do with non-normality?

Going back to our assumptions, let’s re-start our process of this test but with a different grouping of data.

What if we just took the MOST northern and southern sites alone, not groups of sites, and did a t-test on those two? say that instead of the two sites acting as our “northern” sites, we chose to use the single most northern site: “Plum Island Estuary – West Creek” (PIE), and for our southern site, we took the most southern one – “Guana Tolomoto Matanzas NERR” (GTM). If we start again by checking our assumptions, we would first have to check for normality of both our groups.

Let’s start by again dividing our data:

# southern ones stay the same
crab_sizes_s_GTM <- crab_sizes %>%
  dplyr::filter(site == "GTM")

# new sites here
crab_sizes_n_PIE <- crab_sizes %>%
  dplyr::filter(site == "PIE")

# plot our variable of interest (size)
ggpubr::ggdensity(crab_sizes_s_GTM$size, main = "GTM")
ggpubr::ggdensity(crab_sizes_n_PIE$size, main = "PIE")

Okay, so these data look somewhat skewed. Let’s do our Shapiro-Wilk’s test:

stats::shapiro.test(crab_sizes_s_GTM$size)
## 
## 	Shapiro-Wilk normality test
## 
## data:  crab_sizes_s_GTM$size
## W = 0.90078, p-value = 0.01193
stats::shapiro.test(crab_sizes_n_PIE$size)
## 
## 	Shapiro-Wilk normality test
## 
## data:  crab_sizes_n_PIE$size
## W = 0.84894, p-value = 0.0008899

Okay, so our p-values for both are < 0.05, so we have to reject the null – these data are non-normal.

Now what? Well there are other tests available to us!! A reminder that parametric tests like the t-test, often have a non-parametric alternative, that usually relax assumptions of normality, and are good for data with small sample sizes. Here we have only 28 observations in each group, and non-normality so let’s turn to the non-parametric tests.

Non-Parametric Tests

The best alternative for our case is the Mann-Whitney U Test.

There are a few things to note about this test however. This test actually does NOT compare sample means. The Mann-Whitney’s stated use is to determine whether or not the groups or samples are drawn from equal populations, that is, are the shape of the data the same. It can be interpreted that this is analogous to saying the median values are the same for both groups.

Assumptions

ust because the data do not have to be normal, there are still important assumptions we need to satisfy. The data still need to be independent, but there are two additional assumptions:

  1. The variable needs to be continuous
  2. The data are assumed to be similar in shape

The second assumption here is important! While the data do not have to be normal, they have to exhibit a similar skew.

Continuity

We know that our variable is continuous, so this assumption is met.

Skew

Measuring a degree of skew is somewhat arbitrary and can be difficult. It is beyond the scope of this tutorial to do so, and so we will satisfy ourselves with simply stating that the density distributions are both left-skewed, and leave it at that.

The test

The Mann-Whitney is sometimes referred to as the Mann-Whitney-Wilcoxon test, and so is performed using the wilcox.test() function, again from stats:

stats::wilcox.test(crab_sizes_s_GTM$size,
                  crab_sizes_n_PIE$size)
## 
## 	Wilcoxon rank sum exact test
## 
## data:  crab_sizes_s_GTM$size and crab_sizes_n_PIE$size
## W = 26, p-value = 3.068e-12
## alternative hypothesis: true location shift is not equal to 0

Here we see our p-value is << 0.05, so we can reject our null hypothesis that say that these two medians are equal, and thus conclude (generally) that the northern and southern populations have different sizes. Note again that this is a two-sided test so we’re not testing directionality.