Skip to content

Selecting & Renaming Columns

When working with dataframes, you’ll often need to select specific columns to focus your analysis or create smaller subsets of your data. You might also need to rename columns for clarity, consistency, or to meet the requirements of other functions or systems. This tutorial will guide you through various methods for selecting and renaming columns in R, using both base R functions and the powerful dplyr package from the Tidyverse.

Setup: Loading Data and Libraries

First, let’s load the necessary R packages. We’ll primarily use dplyr (which is part of the tidyverse) for its intuitive data manipulation verbs, and lterdatasampler for our example dataset.

library(tidyverse)
library(lterdatasampler)

For this tutorial, we will be working with the streamchem dataset from the lterdatasampler package. This dataset contains stream chemistry data from the Luquillo Experimental Forest in Puerto Rico. Let’s load it and inspect its structure using str().

streamchem <- lterdatasampler::luq_streamchem
str(streamchem)
## tibble [317 × 22] (S3: tbl_df/tbl/data.frame)
##  $ sample_id  : chr [1:317] "QS" "QS" "QS" "QS" ...
##  $ sample_date: Date[1:317], format: "1987-01-05" "1987-01-13" ...
##  $ gage_ht    : num [1:317] 2.82 2.66 2.61 2.58 ...
##  $ temp       : num [1:317] 20 20 20 20 ...
##  $ p_h        : num [1:317] 7.22 7.34 7.12 7.19 ...
##  $ cond       : num [1:317] 48.2 49.8 50.3 50.4 ...
##  $ cl         : num [1:317] 7.3 7.5 7.5 7.3 ...
##  $ no3_n      : num [1:317] 97 114 115 117 ...
##  $ so4_s      : num [1:317] 0.52 0.73 NA NA ...
##  $ na         : num [1:317] 4.75 4.81 5.19 5.08 ...
##  $ k          : num [1:317] 0.18 0.19 0.2 0.18 ...
##  $ mg         : num [1:317] 1.5 1.58 1.66 1.64 ...
##  $ ca         : num [1:317] 2.46 6.53 2.6 2.67 ...
##  $ nh4_n      : num [1:317] 14 20 25 30 ...
##  $ po4_p      : num [1:317] NA NA NA NA ...
##  $ doc        : num [1:317] 0.72 0.74 0.69 0.62 ...
##  $ dic        : num [1:317] 3.37 3.35 4.42 3.3 ...
##  $ tdn        : num [1:317] NA NA NA NA ...
##  $ tdp        : num [1:317] NA NA NA NA ...
##  $ si_o2      : num [1:317] 11.8 12.2 12.5 12.4 ...
##  $ don        : num [1:317] NA NA NA NA ...
##  $ tss        : num [1:317] 2.3 0.72 1.05 1.05 ...

The streamchem dataframe has 317 observations and 22 variables, covering various chemical measurements, sample identifiers, and dates.

Luquillo Stream Chemistry Data

The Luquillo Experimental Forest (LEF) is a Long-Term Ecological Research (LTER) site in Puerto Rico. The streamchem data provides valuable insights into the ecological and hydrological processes of this tropical forest ecosystem by documenting various chemical parameters in its streams over time.

Selecting Columns

Selecting columns involves choosing a subset of columns from a dataframe. This can be useful for simplifying your dataset, focusing on relevant variables, or preparing data for specific analyses.

Base R

Base R provides a couple of ways to select columns, primarily using bracket notation or the subset() function.

# Selecting specific columns by name using brackets: dataframe[, c("col1", "col2")]
streamchem[, c("sample_id", "sample_date", "temp", "p_h")]

# Using the subset() function: subset(dataframe, select = c("col1", "col2"))
subset(streamchem, select = c("sample_id", "sample_date", "temp", "p_h"))

While functional, base R methods for column selection can become verbose if you are selecting many columns or if the selection logic is complex. The column names must be explicitly written out in a character vector.

Tidyverse (dplyr)

The dplyr package offers the versatile select() function, which provides a more flexible and readable syntax for column selection.

Basic Selection

# Selecting specific columns with select()
# Syntax: dataframe %>% select(col1, col2, colN)
streamchem %>% select(sample_id, sample_date, temp, p_h)

# Selecting a range of columns using the : operator
# This selects sample_id, sample_date, gage_ht, temp, and p_h
streamchem %>% select(sample_id:p_h)

Using Selection Helpers

dplyr‘s select() can be combined with “selection helpers”—functions that allow you to select columns based on patterns in their names or other properties.

# Select columns that start with "sample"
streamchem %>% select(starts_with("sample"))

# Select columns that contain "temp" (case sensitive)
streamchem %>% select(contains("temp"))

Other useful selection helpers include:

  • ends_with("suffix"): Selects columns whose names end with a specified suffix.
  • matches("regex"): Selects columns whose names match a regular expression.
  • num_range("prefix", range): Selects columns like prefix1, prefix2, …, prefixN.
  • everything(): Selects all columns (useful for reordering).
  • last_col(): Selects the last column.

Selecting Columns from a Character Vector

If you have a list of column names stored in a character vector, you can use all_of() or any_of() to select them.

cols_to_keep <- c("sample_id", "temp", "p_h", "non_existent_col")

# Using all_of(): This will produce an error if any column in the vector doesn't exist in the dataframe.
# streamchem %>% select(all_of(c("sample_id", "temp", "p_h"))) # This works
# streamchem %>% select(all_of(cols_to_keep)) # This would error because "non_existent_col" is not in streamchem

# Using any_of(): This selects only the columns from the vector that actually exist in the dataframe.
# It ignores "non_existent_col" without erroring.
streamchem %>% select(any_of(cols_to_keep))

Selecting Columns Using a Predicate Function with where()

The where() helper allows you to select columns based on a condition that you define using a “predicate function.” A predicate function is simply a function that R applies to each column, and it returns either TRUE or FALSE. If the predicate function returns TRUE for a particular column, that column is selected; otherwise, it is not. This is very powerful for selecting columns based on their properties (e.g., data type, or proportion of missing values) rather than just their names.

In dplyr, you can create these predicate functions on-the-fly using a special shorthand formula notation with the tilde ~. In this notation:

  • The ~ (tilde) indicates the start of an anonymous function (a function without a formal name).
  • The . (period) is a placeholder that represents “the current column being processed.” So, when select() iterates through each column of your dataframe, . will stand for that column’s data.
# Select all numeric columns
# Here, is.numeric is a built-in R function that checks if an object is numeric.
# select(where(is.numeric)) passes each column to is.numeric().
streamchem %>% select(where(is.numeric))

# Select all character columns
# Similarly, is.character() checks if a column is of type character.
streamchem %>% select(where(is.character))

# Select columns with fewer than 10% NA (missing) values using a predicate function
streamchem %>% select(where(~mean(is.na(.)) < 0.1))

Okay, that last one was a bit much and you probably aren’t familiar with anonymous functions yet. Let’s break it down:

  • The ~ tells R we are defining a small, anonymous function.
  • For each column in ‘streamchem’, R will substitute that column for the . (period). This is a placeholder for the current column being processed.
  • is.na(.) : For the current column (.), determine which values are NA (missing). This returns a vector of TRUE/FALSE values.
  • mean(is.na(.)) : Calculate the mean of those TRUE/FALSE values. In R, TRUE is treated as 1 and FALSE as 0. So, the mean of this logical vector gives the proportion of TRUEs, which is the proportion of NA values in the column.
  • < 0.1 : Check if this calculated proportion of NAs is less than 0.1 (i.e., less than 10%).
  • If the condition is TRUE for a column, that column is selected.

These anonymous functions are powerful features which we will encounter again down the line.

Renaming Columns

Renaming columns is another common data manipulation task. Clear and consistent column names make your data easier to understand and work with.

Base R

In base R, you typically rename columns by assigning new names to the colnames() attribute of the dataframe. This can be done by index or by matching existing names.

# Create a copy to avoid modifying the original streamchem for this example
streamchem_base_rename <- streamchem

# Renaming by index (less readable, error-prone if column order changes)
# colnames(streamchem_base_rename)[5] <- "pH_value" # p_h is the 5th column (in the original str() output)
# head(streamchem_base_rename[,1:6]) # Check the change

# Renaming by name (safer)
colnames(streamchem_base_rename)[colnames(streamchem_base_rename) == "p_h"] <- "pH_new"
colnames(streamchem_base_rename)[colnames(streamchem_base_rename) == "cond"] <- "conductivity_new"

# Display a subset of columns to see the change
streamchem_base_rename[, c("sample_id", "pH_new", "conductivity_new")]

Base R renaming can be tedious, especially for multiple columns. Using indices is risky if the column order changes, and matching names requires repeating the dataframe name.

Tidyverse (dplyr)

dplyr provides two main functions for renaming: select() (if you also want to subset columns) and rename() (if you want to keep all columns but change some names). There’s also rename_with() for applying a function to rename columns.

Renaming and Selecting with select()

You can rename columns as part of a select() operation. The syntax is NewName = old_name. Only the specified (and potentially renamed) columns will be kept.

streamchem %>% 
  select(SampleID = sample_id, EventDate = sample_date, Temperature = temp)

Renaming with rename()

The rename() function changes column names while keeping all columns. The syntax is also NewName = old_name.

streamchem %>% 
  rename(pH_value = p_h, Conductivity = cond)

Renaming Using a Named Character Vector

If you have a mapping of old names to new names in a named character vector (c(NewName1 = "oldName1", NewName2 = "oldName2")), you can use it with rename() and the “bang-bang-bang” (!!!) operator (or “unquote-splice” operator).

renaming_map <- c(
  SampleID = "sample_id",
  pH = "p_h",
  WaterTemp = "temp"
)

# The !!! operator unpacks the named vector into arguments for rename()
streamchem %>% rename(!!!renaming_map)

Renaming with a Function Using rename_with()

The rename_with() function allows you to apply a function to column names. You can apply it to all columns or a selection of columns (using selection helpers).

# Rename all columns to uppercase
# Note: the 'na' column (sodium) is renamed to 'NA' (backticked), not the missing value indicator NA.
streamchem %>% rename_with(toupper)

# Rename specific columns (e.g., those starting with "s") to uppercase
# The second argument to rename_with specifies the columns to apply the function to.
streamchem %>% rename_with(toupper, .cols = starts_with("s"))

# You can use your own function, for example, to add a prefix
# The . represents the current column name.
streamchem %>% rename_with(~paste0("luquillo_", .))

# Rename a specific set of columns (if they exist) to lowercase
# First, create a temporary dataframe with some uppercase names for demonstration
temp_df_for_rename <- streamchem %>% rename(SAMPLE_ID = sample_id, TEMP = temp, P_H = p_h)

# Apply tolower to columns specified in any_of()
temp_df_for_rename %>% 
  rename_with(tolower, .cols = any_of(c("SAMPLE_ID", "TEMP", "GAGE_HT")))

Selecting and renaming columns are fundamental data manipulation skills. While base R provides the necessary tools, dplyr functions often offer more readable, flexible, and powerful alternatives, especially when dealing with complex selection logic or batch renaming operations.