Making New Columns
While summarizing columns is useful if we want to pare down our dataframe and get a summary, it’s often the case we actually just want to add a new column to our dataframe, perhaps based on some set of other columns already in our dataframe. In Tidy R, we will almost always use dplyr::mutate()
for this task.
Mutating Columns
Single Values
The simplest use of mutate()
is to make a new column that is made up of either a pre-existing object or a single value. For examples here, let’s turn to a long-term dataset – sugar maple seedlings at Hubbard Brook Experimental Forest.
library(tidyverse)
library(lterdatasampler)
df = lterdatasampler::hbr_maples
head(df)
## # A tibble: 6 × 11
## year watershed elevation transect sample stem_length leaf1area
## <dbl> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 2003 Reference Low R1 1 86.9 13.8
## 2 2003 Reference Low R1 2 114 14.6
## 3 2003 Reference Low R1 3 83.5 12.5
## 4 2003 Reference Low R1 4 68.1 9.97
## 5 2003 Reference Low R1 5 72.1 6.84
## 6 2003 Reference Low R1 6 77.7 9.66
## # … with 4 more variables: leaf2area <dbl>, leaf_dry_mass <dbl>,
## # stem_dry_mass <dbl>, corrected_leaf_area <dbl>
For this example, we only will work with some of the variables, so let’s first select the columns we want (See more on selection here).
df <- df %>%
dplyr::select(year, watershed, elevation, leaf1area,
leaf2area, corrected_leaf_area)
df
## # A tibble: 359 × 6
## year watershed elevation leaf1area leaf2area corrected_leaf_a…
## <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2003 Reference Low 13.8 12.1 29.1
## 2 2003 Reference Low 14.6 15.3 33.0
## 3 2003 Reference Low 12.5 9.73 25.3
## 4 2003 Reference Low 9.97 10.1 23.2
## 5 2003 Reference Low 6.84 5.48 15.5
## 6 2003 Reference Low 9.66 7.64 20.4
## 7 2003 Reference Low 8.82 9.23 21.2
## 8 2003 Reference Low 5.83 6.18 15.2
## 9 2003 Reference Low 8.11 7.13 18.4
## 10 2003 Reference Low 3.02 3.44 9.60
## # … with 349 more rows
We could for example, want a column in this dataset that tells us what forest these data are from. That would be an easy use of the mutate()
function as there’s only one value. We could do this like so:
df <- df %>%
dplyr::mutate(forest = "HBEF")
df
## # A tibble: 359 × 7
## year watershed elevation leaf1area leaf2area corrected_leaf_a…
## <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2003 Reference Low 13.8 12.1 29.1
## 2 2003 Reference Low 14.6 15.3 33.0
## 3 2003 Reference Low 12.5 9.73 25.3
## 4 2003 Reference Low 9.97 10.1 23.2
## 5 2003 Reference Low 6.84 5.48 15.5
## 6 2003 Reference Low 9.66 7.64 20.4
## 7 2003 Reference Low 8.82 9.23 21.2
## 8 2003 Reference Low 5.83 6.18 15.2
## 9 2003 Reference Low 8.11 7.13 18.4
## 10 2003 Reference Low 3.02 3.44 9.60
## # … with 349 more rows, and 1 more variable: forest <chr>
And we see that our new column has been added onto the far right of the dataframe.
Sugar Maple
The iconic sugar maple (Acer saccharum) is a deciduous hardwood species, typically found in the Northeast United States, and Southeast Canada. These species responsible for producing maple syrup may be at risk of growth reduction due to soil acidification – a consequence of long term acid rain. The data on these sugar maples is collected from the Hubbard Brook Experimental Forest in New Hampshire, USA.
Mutating Columns as a Function of Existing Columns
A more common use case is if we want to make a new column from existing columns. Two typical uses here would be:
- Making a new column that is made via a function operating on another column(s)
- Making a factor column that binds together values from other columns
New column Via Numeric Function or Logical Test
Rowwise Functions
In these data, there are two measures of leaf area: leaf1area
and leaf2area
. While they’re both very similar, it might still be useful to have a mean estimate for each observation in the data between the two leaf area measurements.
Say we wanted to calculate the mean value of the measures of leaf area for each row. To make this operation on a row-wise basis, we can use dplyr::rowwise()
which will ensure each calculation is done on a single row. Combining this with mutate()
will produce our desired result.
df <- df %>%
dplyr::rowwise() %>% # ensures calculation done for each row
dplyr::mutate(
mean_area = mean(c(leaf1area, leaf2area), na.rm = TRUE)
)
df
## # A tibble: 359 × 8
## # Rowwise:
## year watershed elevation leaf1area leaf2area corrected_leaf_a…
## <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2003 Reference Low 13.8 12.1 29.1
## 2 2003 Reference Low 14.6 15.3 33.0
## 3 2003 Reference Low 12.5 9.73 25.3
## 4 2003 Reference Low 9.97 10.1 23.2
## 5 2003 Reference Low 6.84 5.48 15.5
## 6 2003 Reference Low 9.66 7.64 20.4
## 7 2003 Reference Low 8.82 9.23 21.2
## 8 2003 Reference Low 5.83 6.18 15.2
## 9 2003 Reference Low 8.11 7.13 18.4
## 10 2003 Reference Low 3.02 3.44 9.60
## # … with 349 more rows, and 2 more variables: forest <chr>,
## # mean_area <dbl>
Now, again, we have our new column tacked on at the end of our dataframe just as we wanted.
Since we have now a mean estimate of leaf area, we might want to compare that to our measure of corrected_leaf_area
. We might want, for instance, to see if the corrected_leaf_area
is a consistent proportion larger than the mean of the two leaf area measurements.
To go about this, we might want to get a ratio of the mean_area
and the corected_leaf_area
for each row. To do so, we can create an operation very similarly to above:
df <- df %>%
dplyr::rowwise() %>% # we still need this to get a value for each row
dplyr::mutate(
corrected_mean_ratio = mean_area / corrected_leaf_area
)
summary(df$corrected_leaf_area)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.597 21.180 25.950 26.687 31.751 55.874 119
ifelse()
A handy component of mutate()
is that we can create a new column based on a conditional statement like an ifelse()
statement (see more on this here). Briefly, in an ifelse()
statement, we have a boolean statement (i.e. TRUE/FALSE), and the function performs an action depending on whether or not the boolean is true. The structure is ifelse(test_expression, x, y)
, where x
is what will happen if the statement is TRUE, and y
is what will happen if the statmeent is FALSE. In the corrected_leaf_area
column, we may want to know if a particular value falls above or below the mean value for all of our observations. We could create a simple categorical variable based on this for each row with an ifelse()
statement.
# overall mean
mean_val <- mean(df$corrected_leaf_area, na.rm = TRUE)
# make new column
df <- df %>%
# still need to rowwise the calculation
dplyr::rowwise() %>%
dplyr::mutate(
# we specify we want this as a factor
corrected_above_below = as.factor(ifelse(
# this is the logical test
corrected_leaf_area > mean_val,
# value if test is TRUE
"Above",
# value if test is FALSE
"Below"
))
)
We look at our new column like this:
table(df$corrected_above_below)
##
## Above Below
## 111 129
Binding Columns
While this is a less common use case, it is often handy to make a new column grouping variable that is informative and can be made for each row. For example, say we wanted to have some way to group our observations by three components: the year the data were collected (2003 vs. 2004), the watershed (reference vs. w1), and elevation (low or mid). In this example, we’ll do this by pasting those elements together in a new row-wise column.
First, let’s take a look at the summaries of each of these columns:
summary(df$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2003 2003 2003 2003 2004 2004
summary(df$watershed)
## Reference W1
## 179 180
summary(df$elevation)
## Low Mid NA's
## 120 120 119
Importantly, it looks like there are some NA’s in the elevation column. For our purposes, we’ll want to keep those around as there are many observations that have NA’s and we don’t want to throw those all away.
Now, let’s perform our calculation:
df <- df %>%
dplyr::rowwise() %>%
dplyr::mutate(
# we can use paste() from base R to paste the characters together
group = as.factor(paste(year,
watershed,
elevation,
sep = "_")) # we want each component separated by a "_"
)
df
## # A tibble: 359 × 11
## # Rowwise:
## year watershed elevation leaf1area leaf2area corrected_leaf_a…
## <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 2003 Reference Low 13.8 12.1 29.1
## 2 2003 Reference Low 14.6 15.3 33.0
## 3 2003 Reference Low 12.5 9.73 25.3
## 4 2003 Reference Low 9.97 10.1 23.2
## 5 2003 Reference Low 6.84 5.48 15.5
## 6 2003 Reference Low 9.66 7.64 20.4
## 7 2003 Reference Low 8.82 9.23 21.2
## 8 2003 Reference Low 5.83 6.18 15.2
## 9 2003 Reference Low 8.11 7.13 18.4
## 10 2003 Reference Low 3.02 3.44 9.60
## # … with 349 more rows, and 5 more variables: forest <chr>,
## # mean_area <dbl>, corrected_mean_ratio <dbl>,
## # corrected_above_below <fct>, group <fct>
Now we can see how many unique combinations of these three variables there are:
table(df$group)
##
## 2003_Reference_Low 2003_Reference_Mid 2003_W1_Low
## 60 60 60
## 2003_W1_Mid 2004_Reference_NA 2004_W1_NA
## 60 59 60
We see that the observations are spread almost perfectly across the six different groups.