Filtering & Subsetting Data
To filter or subset data is to take a dataframe or some other object, and select only part of that object based on some criteria. Most commonly we want to make a new dataframe from an existing dataframe, by filtering the rows of the dataframe based on one or more column.
As with all things data, we provide an option to perform the task using base R functions, as well as in the Tidyverse.
Filter Based on Single Value
We will use an example dataset here:
# remotes package is needed to get the most recent version of the package
install.packages("remotes")
Error in install.packages : Updating loaded packages
# install package
remotes::install_github("lter/lterdatasampler")
library(lterdatasampler)
library(tidyverse)
If we want to filter a dataset based on a single value in a column, we can make use of Boolean operators.
# first call the dataset from the lterdatasampler package
crabs <- lterdatasampler::pie_crab
The::
Operator
Why have we written lterdatasampler:: pie_crab
here? Writing the package name and the ::
operation tells R which namespace to look in for the object we’re trying to use during our assignment statement!
Let’s look at the columns
# this will show us the names of the different columns
names(crabs)
## [1] "date" "latitude" "site"
## [4] "size" "air_temp" "air_temp_sd"
## [7] "water_temp" "water_temp_sd" "name"
We see there is a column called site
. We can get the unique values of that column:
unique(crabs$site)
## [1] "GTM" "SI" "NIB" "ZI" "RC" "VCR" "DB" "JC" "CT" "NB"
## [11] "CC" "BC" "PIE"
So we might want to filter our data such that we have a new dataframe, where site
is only equal to “VCR”.
Fiddler Crabs
The Atlantic marsh fiddler crab (Minuca pugnax) can be found along the Eastern seaboard of the United States, from Florida, up into the North East. They are found in salt marshes, and have been documented as moving further north with climate change and warming habitats. The data on these particular crabs in our example dataset comes from the Plum Island Long Term Ecological Research Site in Massechusetts.
base R
Filtering in base R makes use of the square bracket indexing discussed elsewhere. Recall that the index goes [row, column]. In the row position, we use the which()
function, which simply returns where some indices are TRUE. Since we want to keep all the columns in our dataset, we leave the column
position empty, indicating that all columns should be retained. In our example that looks like this:
vcr_crabs <- crabs[which(crabs$site == "VCR"),]
We can check it worked by calling unique()
on the new dataframe as above:
unique(vcr_crabs$site)
## [1] "VCR"
And we can see it worked as expected.
Tidyverse
Filtering in the Tidyverse uses mostly the dplyr
package, which is EXTREMELY useful for most data management exercises. Here we use the aptly named filter()
function to perform the operation.
In order to make the best use of the Tidyverse, we need to the %>%
operator, which we call a “pipe”. It’s actually from a Tidyverse package called magrittr
and is used to pass or “pipe” an object forward to a function or expression call. Think of it in terms of a “workflow” with “pipelines” of analysis and data manipulation. Why is an operator like %>%
all that beneficial? It seems annoying at first, but actually semantically changes your code in a way that makes it more intuitive to both read and write. If you’re interested, you can find out more about the pipe, and other magrittr
functions in the package docs.
vc_crabs <- crabs %>%
dplyr::filter(site == "VCR")
The %>%
pipe operator is being used to pass the original data frame object crabs
to our function, filter()
, and the Boolean operator ==
is the only argument needed for this function.
Filter Based on %in%
Operator
There are many permutations of this single value option with Boolean operators. For example, using either approach, we could filter asking R to select rows where site == "VCR" | site == "BC"
which would select rows where the site was either “VCR” or “BC”. In the case where we had a separate vector dictating the focal sites we were interested in, we could use that vector in combination with the %in%
operator to deal with this:
base R
focal_sites <- c("ZI", "RC", "VCR", "DB", "JC")
focal_crabs <- crabs[which(crabs$site %in% focal_sites),]
Tidyverse
focal_sites <- c("ZI", "RC", "VCR", "DB", "JC")
focal_crabs <- crabs %>%
dplyr::filter(site %in% focal_sites)
This filtering can be extended to use multiple columns relatively simply. There is a variable in this dataset called size.
hist(crabs$size)
We may want to filter crabs only in our focal sites, but also those above some size (e.g. 15).
base R
site_size <- crabs[which(crabs$site %in% focal_sites &
crabs$size > 15), ]
Tidyverse
site_size <- crabs %>%
dplyr::filter(site %in% focal_sites & size > 15)
We can always use multiple pipes as well to chain commands together, even if they use the same function. For example, we could expand the above example into two filter()
commands if we wanted to:
site_size <- crabs %>%
dplyr::filter(site %in% focal_sites) %>%
dplyr::filter(size > 15)
Here we just use another logical operator, &
to link our two conditions together. We can double check:
hist(site_size$size)
unique(site_size$site)
## [1] "ZI" "RC" "VCR" "DB" "JC"
So we can see it worked well.
Filtering out NAs
For this example, we will use a different dataset.
ice <- lterdatasampler::ntl_icecover
head(ice)
## # A tibble: 6 × 5
## lakeid ice_on ice_off ice_duration year
## <fct> <date> <date> <dbl> <dbl>
## 1 Lake Mendota NA 1853-04-05 NA 1852
## 2 Lake Mendota 1853-12-27 NA NA 1853
## 3 Lake Mendota 1855-12-18 1856-04-14 118 1855
## 4 Lake Mendota 1856-12-06 1857-05-06 151 1856
## 5 Lake Mendota 1857-11-25 1858-03-26 121 1857
## 6 Lake Mendota 1858-12-08 1859-03-14 96 1858
We can see already that there will be some values of ice_duration
that will have values of NA.
Ice Cover
When we think of climate change, we’re often treated to news media of melting glaciers or ice caps at the poles. However, reduction of ice cover is not unique to those environments, and in fact, many lake ecosystems rely on ice cover during the winter. Three lakes near Madison, WI, USA have been used for the monitoring of how climate change is possibly causing the change of ice cover duration in temperate lakes, and these data are the ones we use in this example.
If we’re To filter those out of the dataset, we can use the is.na()
function. This function can also be used along with the any()
command to see if any of the values in our column of interest are NA values. This is useful for long dataframes that we can’t see all the values for. Let’s check out column ice_duration
now:
any(is.na(ice$ice_duration))
## [1] TRUE
And as we expected, it returns TRUE.
base R
trimmed_ice <- ice[which(!is.na(ice$ice_duration)),]
Again, we want to keep all the columns present.
Tidyverse
trimmed_ice <- ice %>%
dplyr::filter(!is.na(ice_duration))
Here we use the combination of the !
and is.na()
to first find the positions where the value is equal to NA, and then to remove them with the !
.
Any conceivable set of criteria can be strung together (carefully!) to filter a dataset with as many conditions as necessary. It is always a good idea however when you are just beginning to use these functions, to work piece by piece, adding conditions one at a time and testing each one to make sure it’s doing what you want. Then you can string multiple conditions together without worrying that an error is not being caught.