Sampling & Sampling Distributions
Generally, in biology we want to learn something about the world as we define it (eg. all pine trees in North America, Anoles in the Carribean, Drosophila in Africa), but we usually can’t measure all individuals in the populations we care about in order to calculate the parameters of interest (e.g. mean weight of all the Anoles, etc). When we can’t do that, we need to take a sample of the population, learn something about the sample, and use what we have learned to generalize outward to the population. In fact, if we could sample the entirety of a population of interest, we wouldn’t need most types of statistical models (e.g. regression)! We could just calculate our parameters of interest, and be on our way. However, therein the problem lies, since we want usually to make some statement about the whole population, but we don’t usually have data on the whole population. So, we take a sample, and try to figure out how best to extrapolate the knowledge we learned from that sample back to the rest of population.
Ideally, when we take some sample of a population it’s unbiased, that is, we can take some measurement from the sample such as a mean value, and that value is in fact the same as the population-level value. If we don’t collect our sample properly, it’s possible that we systematically over or underestimate our population parameter of interest, hence making our estimate biased. It’s useful to think of some examples of when our sampling protocol might yeild biased samples. Say we want to know the mean size of Anoles in the Carribean. If our sampling method was to do ground observations of Anoles that we could see on a gridded sampling site, it’s likely we’re more likely to see, and therefore sample, Anoles that are more brightly coloured, and ones that are lower down in trees or on the ground. Is this a problem for our subsequent statistical calculations? Well, only if we know that our sample is biased and a) we don’t do anyuthing to change our sampling protocol, and/or b) we don’t account for it in our statistical models/tests.
Why is this imporant from a statistical perspective? Well, the methods we present in this section usually rely on a random sample of the population. Now in reality sometimes a random sample is impossible, and we work with data that have known limitations. An interesting example of this in an observational sample context is camera trap data. There are known biases (the interested reader can refer to this paper on the subject) in camera trap data, but, wildlife researchers who have no choice but to use these data have come up with cleaver methods to account for these biases statistically. These techniques are outside the scope of this manual, but are interesting nonetheless.
Since dealing with biased samples is difficult, we will constrain our discussion of sampling to random samples, the premise on which a great many common statistical tests are based.
Statistically, we will think of sampling as referring to the drawing of a subset of data points from some larger set (population). How we sample can be the subject of an entire course, but it is helpful to know enough to perform sampling in R, and then to compile those samples together into a sampling distribution.
Common Distributions
Before we go any further, let’s remind ourselves of the shapes and names of some of the common distributions that we’ll want to have fresh in our minds for this section. As another reminder, we need to be conscious as to whether we’re dealing with a continuous distribution or a discrete distribution. Here’s a visual to remind us of this:
Random Sampling
So when we want to sample some set of numbers, we have to usually define what kind of distribution we want to draw from. When we ask R to randomly sample a number for us, we need to at least provide it with some guidelines about what kind of number we want. Do we want only integers? Decimals? If so, how many significant digits? What are the limits?
The way we would typically think about sampling in our brains is similar to the example of having a set of balls in a bag, all labeled with a unique number from 1-10. Assuming all else about the balls is equal, we could guess that there would be an equal probability we would draw any given ball from the bag. While this feels intuitive, this is actually thinking in distributions! We have actually just stated that we believe the distribution we are sampling from is uniform, that is, that all possible values have the same probability of being chosen.
Contrast that with something like a Normal (Gaussian) distribution, and we will be more likely to select a value that lands closer to the center of the distribution, at the mean.
Random Sampling in R
It’s actually quite easy to perform random sampling in R, given that it’s a statistical programming language, the basic version of R that comes installed contains the stats
package which includes random sampling function for a variety of distributions. Let’s use an example with the normal distribution. Since the normal distribution takes two parameters, 1) a mean, and 2) a standard deviation, we need to provide those two parameters to the function, along with the number of samples we’d like to draw.
stats::rnorm(n = 10, mean = 5, sd = 1.5)
## [1] 2.012703 4.735336 7.774550 1.796493 2.102768 4.735893 5.427244 4.673579 4.225146 3.122918
And we see R has selected our values for us. Often we want to be able to plot the distribution of values that we’ve sampled. That’s most easily done as a simple histogram, using the hist()
function which is in the graphics
package:
And we see we’ve approximated a normal distribution here.
Sampling Distributions
A sampling distribution refers to the probability distribution of a particular statistic that we might get from a random sample.
Discrete Sampling Distributions
Let us think back to our example in Probability 101 of a discrete sampling distribution – gene frequency. Imagine we were able to “zoom in” on some random part of our own DNA, we would see something like this:
We could imagine that the frequency of each gene may be equivalent, However, research has shown (i.e. Louie et al. (2003)) that nucleotide frequency is not uniform across human genes. So given some particular gene, it may be the case that we see a non-uniform probability distribution, but one that looks like this:
library(ggplot2)
df <- data.frame(
base = c("A", "C", "G", "T"),
prob = c(0.1, 0.2, 0.45, 0.25)
)
ggplot(data = df) +
geom_col(aes(x = base, y = prob, fill = base)) +
ggthemes::theme_base() +
labs(x = "Nucleotide Base", y = "Probability") +
ylim(c(0,1))
As a fun little game, let’s build a gene. There approximately 1000 nucleotide pairs of coding sequence per gene according to google, and let’s imagine that the gene we’re building has a probability distribution of nucleotides that is given above, where the probabilities of each base appearing given a random sample are: 0.1, 0.2, 0.45, and 0.25 respectively for A, C, G, and T.
Using R’s functionality, let’s sample a variable number times from our weighted set, and see if the distribution we get from our sample matches the theoretical probabilities. We’ll sample 20, 100, 1000, and 10,000 times. The Law of Large Numbers indicates that the more samples we take, the closer we’ll get to our theoretical probabilities. Let’s try!
To do this, we’ll use R’s sample()
function from the base package, and alter only the size
argument which says how many times to sample.
set.seed(1234)
sample_20 <- sample(c("A", "C", "G", "T"),
size = 20,
replace = TRUE,
prob = c(0.1, 0.2, 0.45, 0.25))
sample_100 <- sample(c("A", "C", "G", "T"),
size = 100,
replace = TRUE,
prob = c(0.1, 0.2, 0.45, 0.25))
sample_1000 <- sample(c("A", "C", "G", "T"),
size = 1000,
replace = TRUE,
prob = c(0.1, 0.2, 0.45, 0.25))
sample_10000 <- sample(c("A", "C", "G", "T"),
size = 10000,
replace = TRUE,
prob = c(0.1, 0.2, 0.45, 0.25))
Now let’s take these vectors and turn them into dataframes so we can plot them. We can use the convenient table()
function to get counts of how many of each base there is:
library(dplyr)
library(ggplot2)
library(ggthemes)
df_20 <- data.frame(
# use table()
table(sample_20)
) %>%
# rename them to be consistent
dplyr::rename(
base = sample_20,
prob = Freq
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
# add column to tell us how many samples were drawn
sample = "20",
# turn value from a count to a probability
prob = prob/20
)
df_100 <- data.frame(
# use table()
table(sample_100)
) %>%
# rename them to be consistent
dplyr::rename(
base = sample_100,
prob = Freq
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
# add column to tell us how many samples were drawn
sample = "100",
# turn value from a count to a probability
prob = prob/100
)
df_1000 <- data.frame(
# use table()
table(sample_1000)
) %>%
# rename them to be consistent
dplyr::rename(
base = sample_1000,
prob = Freq
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
# add column to tell us how many samples were drawn
sample = "1000",
# turn value from a count to a probability
prob = prob/1000
)
df_10000 <- data.frame(
# use table()
table(sample_10000)
) %>%
# rename them to be consistent
dplyr::rename(
base = sample_10000,
prob = Freq
) %>%
dplyr::rowwise() %>%
dplyr::mutate(
# add column to tell us how many samples were drawn
sample = "10000",
# turn value from a count to a probability
prob = prob/10000
)
Now we’ll combine all these data frames into one, add a reference of the real values, and plot the result with a grouped bar chart:
# join samples up
df_all_samples <- rbind(
df_20,
df_100,
df_1000,
df_10000
) %>%
# add in a column to denote these are sampled not theoretical
dplyr::mutate(
type = "sampled"
)
df_real <- data.frame(
base = c("A", "C", "G", "T"),
prob = c(0.1, 0.2, 0.45, 0.25),
sample = "Real values",
type = "theoretical"
)
# join the dataframes again
df_all <- rbind(
df_all_samples %>%
dplyr::mutate(
sample = as.character(sample)
),
df_real
)
# plot the result
ggplot() +
geom_bar(data = df_all,
mapping = aes(x = base, y = prob, alpha = type,
fill = sample, colour = type),
position = "dodge", stat = "identity") +
ggthemes::theme_base() +
scale_fill_manual(
"Sample Size",
values = wesanderson::wes_palette("Rushmore1", n = 5)) +
scale_alpha_manual("Theoretical or Sampled", values = c(0.5, 1)) +
scale_colour_manual("Theoretical or Sampled", values = c("white", "black"))
Here we have shown intuitively the Law of Large Numbers!!!!
Continuous Sampling Distributions
Consider records for aquatic vertebrates (cutthroat trout and salamanders) in Mack Creek, Andrews Experimental Forest, Oregon (1987 – present). This dataset is present in the lterdatasampler
package.
library(lterdatasampler)
df <- lterdatasampler::and_vertebrates
head(df)
## # A tibble: 6 × 16
## year sitecode section reach pass unitnum unittype vert_index pitnumber species length_1_mm length_2_mm weight_g clip sampledate notes
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <date> <chr>
## 1 1987 MACKCC-L CC L 1 1 R 1 NA Cutthroat trout 58 NA 1.75 NONE 1987-10-07 <NA>
## 2 1987 MACKCC-L CC L 1 1 R 2 NA Cutthroat trout 61 NA 1.95 NONE 1987-10-07 <NA>
## 3 1987 MACKCC-L CC L 1 1 R 3 NA Cutthroat trout 89 NA 5.6 NONE 1987-10-07 <NA>
## 4 1987 MACKCC-L CC L 1 1 R 4 NA Cutthroat trout 58 NA 2.15 NONE 1987-10-07 <NA>
## 5 1987 MACKCC-L CC L 1 1 R 5 NA Cutthroat trout 93 NA 6.9 NONE 1987-10-07 <NA>
## 6 1987 MACKCC-L CC L 1 1 R 6 NA Cutthroat trout 86 NA 5.9 NONE 1987-10-07 <NA>
Let us look at the distribution of cutthroat trout sizes in our database:
hist(df$length_1_mm)
We can see that this distribution isn’t quite normal, but is continuous, and right-skewed. It could be approximated by a log-normal distribution:
x <- seq(0, 250)
y <- dlnorm(x, meanlog = 4.1, sdlog = 0.4)
# figure out how to scale the y for the plot alone
scale_factor <- 5000 / 0.01780701
y <- y * scale_factor
dist_df <- data.frame(x, y)
ggplot() +
geom_histogram(df, mapping = aes(x = length_1_mm),
colour = "grey20", fill = "lightblue") +
geom_line(dist_df, mapping = aes(x = x, y = y),
colour = "red", linetype = "dashed", size = 2) +
ggthemes::theme_base() +
labs(x = "Cutthroat Trout Length (mm)", y = "Density")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17 rows containing non-finite values (`stat_bin()`).
What will happen if we draw a number of random samples from this distribution and calculate the sample mean? What will the resulting distribution of sample means look like?
What we are doing here is constructing a sampling distribution. We can do this by randomly drawing samples from our population (in this example, every fish in our data set is the “population”), then calculating the mean for each random sample.
As we discuss in the programming concepts section on pseudocoding, let’s think about how to accomplish this before we start blindly writing code. There are steps to our process here:
- Determine how many random samples to draw (let’s call this n)
- 1.1 How many observations in each sample (let’s call this x)?
- 2.1 Pre-allocate an object to hold n number of sample means
- Draw a sample
- 2. 1. Pull out x number of observations
- 2. 2. Calculate the mean
- 2. 3. Store that mean in the corresponding location of the pre-allocated object
- 3. Repeat step 2 n times
The easiest way to actually accomplish this is through a for-loop. We’ll start this and do it in order:
# first, we'll draw 100 samples
n <- 100
# let's pull 10 observations in each sample
x <- 10
# preallocate object to hold sample means
sample_means <- vector(mode = "numeric", length = n)
So how would we perform step 2? We’d sample 10 values from df$length_1_mm
, calculate the mean, and assign the mean value to sample_means[n]
. Then, we simply repeat this n times. Here’s that code:
# first, sample randomly from the vector of value
for(i in 1:n) {
sample_means[i] <- mean(sample(df$length_1_mm, size = x, replace = FALSE))
}
head(sample_means)
## [1] 80.5 79.2 63.6 82.3 73.6 59.9
Okay, so this worked! Excellent. Now, let’s plot a histogram of our results:
hist(sample_means)
Hmm!! A confusing distribution. This doesn’t really resemble any interesting or meaningful distribution. Let’s try the process again, but with a much larger value of n, and see if that changes the result:
n <- 100000
x <- 10
sample_means <- vector(mode = "numeric", length = n)
for(i in 1:n) {
sample_means[i] <- mean(sample(df$length_1_mm, size = x, replace = FALSE))
}
hist(sample_means)
plot of chunk unamed-chunk-12
Wow! This distribution is remarkably normal! What’s going on here????
We have just stumbled upon the Central Limit Theorem.
Central Limit Theorem
The Central Limit Theorem is an incredibly important theorem that states the following:
Taking a population with some mean μ and a standard deviation σ, we can sample from the population (with replacement), and with a number of samples that is sufficiently large, the distribution of the sample means will be normally distributed.
We have just demonstrated this above with our cutthroat trout example. This is a surprisingly important stated theorem, because it provides us a path forward to work with statistical tests that assume normal distributions. As you will learn as you move further through your statistical training, normality of data makes our lives infinitely easier.
I strongly recommend that if you care to learn more about statistics, to think deeply about the central limit theorem, and try to come up your own intuitive understanding of why it is true. There are many excellent descriptions of proofs of the central limit theorem. For the advanced and interested student, there’s an excellent recorded lecture by John Tsitsiklis at MIT that goes into detail on this theorem and why it is central not only to inferential statistics but how it links statistics to calculus.
Law of Large Numbers
In the example we gave on discrete sampling distributions, we have already discovered what this “law” purports. Essentially, it states that if you repeat an experiment some number, n
, of times and average the result, as n
gets larger, the averaged result will get closer and closer to the expected value.
We saw this in our example, as the sampling option with 10,000 samples was the closest one to the expected value. We could deduce a short proof for this, but that’s beyond the scope of this short piece.
What is an important takeaway is that this law says something valuable to us about what we need to keep in mind when we deal with small numbers. That is, the smaller the sample size at hand, the further we are from actually getting a representative sample of the population. While this may seem intuitive, it’s helpful to understand that there is a provable way to understand this concept (We can also think about this as it relates to statistical power.)