Skip to content

Making New Columns

While summarizing columns is useful if we want to pare down our dataframe and get a summary, it’s often the case we actually just want to add a new column to our dataframe, perhaps based on some set of other columns already in our dataframe. In Tidy R, we will almost always use dplyr::mutate() for this task.

Mutating Columns

Single Values

The simplest use of mutate() is to make a new column that is made up of either a pre-existing object or a single value. For examples here, let’s turn to a long-term dataset – sugar maple seedlings at Hubbard Brook Experimental Forest.

library(tidyverse)
library(lterdatasampler)
df = lterdatasampler::hbr_maples
head(df)
## # A tibble: 6 × 11
##    year watershed elevation transect sample stem_length leaf1area
##   <dbl> <fct>     <fct>     <fct>    <fct>        <dbl>     <dbl>
## 1  2003 Reference Low       R1       1             86.9     13.8 
## 2  2003 Reference Low       R1       2            114       14.6 
## 3  2003 Reference Low       R1       3             83.5     12.5 
## 4  2003 Reference Low       R1       4             68.1      9.97
## 5  2003 Reference Low       R1       5             72.1      6.84
## 6  2003 Reference Low       R1       6             77.7      9.66
## # … with 4 more variables: leaf2area <dbl>, leaf_dry_mass <dbl>,
## #   stem_dry_mass <dbl>, corrected_leaf_area <dbl>

For this example, we only will work with some of the variables, so let’s first select the columns we want (See more on selection here).

df <- df %>%
dplyr::select(year, watershed, elevation, leaf1area,
                leaf2area, corrected_leaf_area)
df
## # A tibble: 359 × 6
##     year watershed elevation leaf1area leaf2area corrected_leaf_a…
##    <dbl> <fct>     <fct>         <dbl>     <dbl>             <dbl>
##  1  2003 Reference Low           13.8      12.1              29.1 
##  2  2003 Reference Low           14.6      15.3              33.0 
##  3  2003 Reference Low           12.5       9.73             25.3 
##  4  2003 Reference Low            9.97     10.1              23.2 
##  5  2003 Reference Low            6.84      5.48             15.5 
##  6  2003 Reference Low            9.66      7.64             20.4 
##  7  2003 Reference Low            8.82      9.23             21.2 
##  8  2003 Reference Low            5.83      6.18             15.2 
##  9  2003 Reference Low            8.11      7.13             18.4 
## 10  2003 Reference Low            3.02      3.44              9.60
## # … with 349 more rows

We could for example, want a column in this dataset that tells us what forest these data are from. That would be an easy use of the mutate() function as there’s only one value. We could do this like so:

df <- df %>%
dplyr::mutate(forest = "HBEF")
df
## # A tibble: 359 × 7
##     year watershed elevation leaf1area leaf2area corrected_leaf_a…
##    <dbl> <fct>     <fct>         <dbl>     <dbl>             <dbl>
##  1  2003 Reference Low           13.8      12.1              29.1 
##  2  2003 Reference Low           14.6      15.3              33.0 
##  3  2003 Reference Low           12.5       9.73             25.3 
##  4  2003 Reference Low            9.97     10.1              23.2 
##  5  2003 Reference Low            6.84      5.48             15.5 
##  6  2003 Reference Low            9.66      7.64             20.4 
##  7  2003 Reference Low            8.82      9.23             21.2 
##  8  2003 Reference Low            5.83      6.18             15.2 
##  9  2003 Reference Low            8.11      7.13             18.4 
## 10  2003 Reference Low            3.02      3.44              9.60
## # … with 349 more rows, and 1 more variable: forest <chr>

And we see that our new column has been added onto the far right of the dataframe.

Sugar Maple

The iconic sugar maple (Acer saccharum) is a deciduous hardwood species, typically found in the Northeast United States, and Southeast Canada. These species responsible for producing maple syrup may be at risk of growth reduction due to soil acidification – a consequence of long term acid rain. The data on these sugar maples is collected from the Hubbard Brook Experimental Forest in New Hampshire, USA.

Mutating Columns as a Function of Existing Columns

A more common use case is if we want to make a new column from existing columns. Two typical uses here would be:

  1. Making a new column that is made via a function operating on another column(s)
  2. Making a factor column that binds together values from other columns

New column Via Numeric Function or Logical Test

Rowwise Functions

In these data, there are two measures of leaf area: leaf1area and leaf2area. While they’re both very similar, it might still be useful to have a mean estimate for each observation in the data between the two leaf area measurements.

Say we wanted to calculate the mean value of the measures of leaf area for each row. To make this operation on a row-wise basis, we can use dplyr::rowwise() which will ensure each calculation is done on a single row. Combining this with mutate() will produce our desired result.

df <- df %>%
dplyr::rowwise() %>%  # ensures calculation done for each row
dplyr::mutate(
    mean_area = mean(c(leaf1area, leaf2area), na.rm = TRUE)
)
df
## # A tibble: 359 × 8
## # Rowwise: 
##     year watershed elevation leaf1area leaf2area corrected_leaf_a…
##    <dbl> <fct>     <fct>         <dbl>     <dbl>             <dbl>
##  1  2003 Reference Low           13.8      12.1              29.1 
##  2  2003 Reference Low           14.6      15.3              33.0 
##  3  2003 Reference Low           12.5       9.73             25.3 
##  4  2003 Reference Low            9.97     10.1              23.2 
##  5  2003 Reference Low            6.84      5.48             15.5 
##  6  2003 Reference Low            9.66      7.64             20.4 
##  7  2003 Reference Low            8.82      9.23             21.2 
##  8  2003 Reference Low            5.83      6.18             15.2 
##  9  2003 Reference Low            8.11      7.13             18.4 
## 10  2003 Reference Low            3.02      3.44              9.60
## # … with 349 more rows, and 2 more variables: forest <chr>,
## #   mean_area <dbl>

Now, again, we have our new column tacked on at the end of our dataframe just as we wanted.

Since we have now a mean estimate of leaf area, we might want to compare that to our measure of corrected_leaf_area. We might want, for instance, to see if the corrected_leaf_area is a consistent proportion larger than the mean of the two leaf area measurements.

To go about this, we might want to get a ratio of the mean_area and the corected_leaf_area for each row. To do so, we can create an operation very similarly to above:

df <- df %>%
    dplyr::rowwise() %>%  # we still need this to get a value for each row
    dplyr::mutate(
        corrected_mean_ratio = mean_area / corrected_leaf_area
    )
summary(df$corrected_leaf_area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   9.597  21.180  25.950  26.687  31.751  55.874     119
ifelse()

A handy component of mutate() is that we can create a new column based on a conditional statement like an ifelse() statement (see more on this here). Briefly, in an ifelse() statement, we have a boolean statement (i.e. TRUE/FALSE), and the function performs an action depending on whether or not the boolean is true. The structure is ifelse(test_expression, x, y), where x is what will happen if the statement is TRUE, and y is what will happen if the statmeent is FALSE. In the corrected_leaf_area column, we may want to know if a particular value falls above or below the mean value for all of our observations. We could create a simple categorical variable based on this for each row with an ifelse() statement.

# overall mean 
mean_val <- mean(df$corrected_leaf_area, na.rm = TRUE)

# make new column
df <- df %>%
    # still need to rowwise the calculation
    dplyr::rowwise() %>%
    dplyr::mutate(
        # we specify we want this as a factor 
        corrected_above_below = as.factor(ifelse(
        # this is the logical test
        corrected_leaf_area > mean_val,
        # value if test is TRUE
        "Above",
        # value if test is FALSE
        "Below"
        ))
    )

We look at our new column like this:

table(df$corrected_above_below)
## 
## Above Below 
##   111   129

Binding Columns

While this is a less common use case, it is often handy to make a new column grouping variable that is informative and can be made for each row. For example, say we wanted to have some way to group our observations by three components: the year the data were collected (2003 vs. 2004), the watershed (reference vs. w1), and elevation (low or mid). In this example, we’ll do this by pasting those elements together in a new row-wise column.

First, let’s take a look at the summaries of each of these columns:

summary(df$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2003    2003    2003    2003    2004    2004
summary(df$watershed)
## Reference        W1 
##       179       180
summary(df$elevation)
##  Low  Mid NA's 
##  120  120  119

Importantly, it looks like there are some NA’s in the elevation column. For our purposes, we’ll want to keep those around as there are many observations that have NA’s and we don’t want to throw those all away.

Now, let’s perform our calculation:

df <- df %>%
dplyr::rowwise() %>%
dplyr::mutate(
    # we can use paste() from base R to paste the characters together
    group = as.factor(paste(year,
                watershed,
                elevation,
                sep = "_")) # we want each component separated by a "_"
)
df
## # A tibble: 359 × 11
## # Rowwise: 
##     year watershed elevation leaf1area leaf2area corrected_leaf_a…
##    <dbl> <fct>     <fct>         <dbl>     <dbl>             <dbl>
##  1  2003 Reference Low           13.8      12.1              29.1 
##  2  2003 Reference Low           14.6      15.3              33.0 
##  3  2003 Reference Low           12.5       9.73             25.3 
##  4  2003 Reference Low            9.97     10.1              23.2 
##  5  2003 Reference Low            6.84      5.48             15.5 
##  6  2003 Reference Low            9.66      7.64             20.4 
##  7  2003 Reference Low            8.82      9.23             21.2 
##  8  2003 Reference Low            5.83      6.18             15.2 
##  9  2003 Reference Low            8.11      7.13             18.4 
## 10  2003 Reference Low            3.02      3.44              9.60
## # … with 349 more rows, and 5 more variables: forest <chr>,
## #   mean_area <dbl>, corrected_mean_ratio <dbl>,
## #   corrected_above_below <fct>, group <fct>

Now we can see how many unique combinations of these three variables there are:

table(df$group)
## 
## 2003_Reference_Low 2003_Reference_Mid        2003_W1_Low 
##                 60                 60                 60 
##        2003_W1_Mid  2004_Reference_NA         2004_W1_NA 
##                 60                 59                 60

We see that the observations are spread almost perfectly across the six different groups.