Skip to content

Special Type: Factors

Factors are R’s way of handling categorical data, where variables can only take on a limited number of distinct values, known as levels. While base R provides functions for working with factors, the forcats package, part of the Tidyverse, offers a suite of intuitive and powerful tools that make factor manipulation much more straightforward. This article explores common operations you’ll perform on factors using forcats, such as reordering, renaming, collapsing, and dropping levels.

Setup and Sample Factor

First, let’s load the tidyverse (which includes forcats). We’ll then create a sample factor representing different treatment groups in a hypothetical experiment. This will be our working example throughout the article.

library(tidyverse) # Automatically loads forcats as well

# Let's create a sample factor to work with.
# This represents different treatment groups in an experiment.
treatment_levels <- c("Control", "LowDose", "MediumDose", "HighDose", "Placebo")
experimental_data <- factor(sample(treatment_levels, 20, replace = TRUE), levels = treatment_levels)

# Initial levels (ordered as defined)
experimental_data
[1] HighDose   LowDose    LowDose    Control    Placebo    Control   
 [7] MediumDose Control    Control    HighDose   HighDose   MediumDose
[13] Placebo    MediumDose Control    LowDose    Placebo    LowDose   
[19] Placebo    LowDose   
Levels: Control LowDose MediumDose HighDose Placebo

Our experimental_data factor has five levels, and their initial order is determined by how we defined treatment_levels.

Re-ordering Levels

The order of factor levels is crucial for visualizations and statistical modeling, as it can affect the reference category in regressions or the display order in plots. fct_relevel() provides precise control over this order. You can either specify the complete new order of levels, or move specific levels to particular positions using the after argument:

  • If you list levels without after, or with after = 0, those levels are moved to the front, and their relative order is maintained. This is the default behavior if you only specify the levels to move.
  • Using after = Inf (infinity) moves the specified levels to the end of the factor order.
  • Using after = n, where n is an integer, moves the specified levels to the position immediately after the n-th current level. For example, after = 1 places the specified levels after the first current level.

Specify the Full New Order

You can provide all levels in their desired new order. Let’s say we want “Control” first, then “Placebo”, followed by the dosage levels.

# Let's say we want "Control" first, then "Placebo", then the doses.
fct_relevel(experimental_data, "Control", "Placebo", "LowDose", "MediumDose", "HighDose")
levels(fct_relevel(experimental_data, "Control", "Placebo", "LowDose", "MediumDose", "HighDose"))
 [1] Placebo    MediumDose LowDose    LowDose    Placebo    MediumDose
 [7] MediumDose HighDose   HighDose   Placebo    LowDose    Control   
[13] HighDose   LowDose    Control    LowDose    Control    Placebo   
[19] LowDose    MediumDose
Levels: Control Placebo LowDose MediumDose HighDose
[1] "Control"    "Placebo"    "LowDose"    "MediumDose" "HighDose"

Move Levels to the Front

To move one or more levels to the beginning, specify the level(s) and use after = 0 (or omit after if it’s the first argument after the factor itself). The remaining levels will keep their relative order.

# Move "Placebo" to be the first level, keeping other orders relative.
# `after = 0` places it at the beginning. This is also the default if `after` is omitted and levels are specified.
fct_relevel(experimental_data, "Placebo", after = 0)
levels(fct_relevel(experimental_data, "Placebo", after = 0))
 [1] Placebo    MediumDose LowDose    LowDose    Placebo    MediumDose
 [7] MediumDose HighDose   HighDose   Placebo    LowDose    Control   
[13] HighDose   LowDose    Control    LowDose    Control    Placebo   
[19] LowDose    MediumDose
Levels: Placebo Control LowDose MediumDose HighDose
[1] "Placebo"    "Control"    "LowDose"    "MediumDose" "HighDose"

Move Levels to the End

Using after = Inf moves specified levels to the end of the factor order.

# Move "Control" to be the last level.
# `after = Inf` places it at the end.
fct_relevel(experimental_data, "Control", after = Inf)
levels(fct_relevel(experimental_data, "Control", after = Inf))
 [1] Placebo    MediumDose LowDose    LowDose    Placebo    MediumDose
 [7] MediumDose HighDose   HighDose   Placebo    LowDose    Control   
[13] HighDose   LowDose    Control    LowDose    Control    Placebo   
[19] LowDose    MediumDose
Levels: LowDose MediumDose HighDose Placebo Control
[1] "LowDose"    "MediumDose" "HighDose"   "Placebo"    "Control"

Move Levels After a Specific Position

You can use an integer with after to place levels after a specific existing level. For instance, after = 1 will place the specified level(s) after the first level in the current order of the factor.

# Move "HighDose" to be after the first level in the current order.
# Original order for experimental_data is Control, LowDose, MediumDose, HighDose, Placebo
# So, HighDose will be moved after "Control".
fct_relevel(experimental_data, "HighDose", after = 1)
levels(fct_relevel(experimental_data, "HighDose", after = 1))
 [1] Placebo    MediumDose LowDose    LowDose    Placebo    MediumDose
 [7] MediumDose HighDose   HighDose   Placebo    LowDose    Control   
[13] HighDose   LowDose    Control    LowDose    Control    Placebo   
[19] LowDose    MediumDose
Levels: Control HighDose LowDose MediumDose Placebo
[1] "Control"    "HighDose"   "LowDose"    "MediumDose" "Placebo"

Re-naming Factor Levels

forcats provides flexible ways to rename factor levels, either by specifying new names directly or by applying a function.

Rename Specific Levels Manually

fct_recode() allows you to change the names of specific levels. The syntax is "NewName" = "OldName".

# Let's shorten "MediumDose" to "MedDose" and "HighDose" to "HiDose"
recoded_factor <- fct_recode(experimental_data,
                             "MedDose" = "MediumDose",
                             "HiDose"  = "HighDose")
recoded_factor
 [1] Placebo MedDose LowDose LowDose Placebo MedDose MedDose HiDose  HiDose 
[10] Placebo LowDose Control HiDose  LowDose Control LowDose Control Placebo
[19] LowDose MedDose
Levels: Control LowDose MedDose HiDose Placebo

Rename Levels Using a Function

fct_relabel() applies a function to all level names. The function you provide should take a character vector (the current levels) and return a character vector (the new levels).

Let’s prepend “Group: ” to each level:

# Let's prepend "Group: " to each level name.
# The function provided to fct_relabel takes the current level names as input.
relabeled_factor <- fct_relabel(experimental_data, ~ paste("Group:", .))
relabeled_factor
levels(relabeled_factor)
 [1] Group: Placebo    Group: MediumDose Group: LowDose    Group: LowDose   
 [5] Group: Placebo    Group: MediumDose Group: MediumDose Group: HighDose  
 [9] Group: HighDose   Group: Placebo    Group: LowDose    Group: Control   
[13] Group: HighDose   Group: LowDose    Group: Control    Group: LowDose   
[17] Group: Control    Group: Placebo    Group: LowDose    Group: MediumDose
5 Levels: Group: Control Group: LowDose Group: MediumDose ... Group: Placebo
[1] "Group: Control"    "Group: LowDose"    "Group: MediumDose"
[4] "Group: HighDose"   "Group: Placebo"

Or convert all levels to uppercase:

# Another `fct_relabel` example: Convert to uppercase
relabeled_upper_factor <- fct_relabel(experimental_data, toupper)
relabeled_upper_factor
levels(relabeled_upper_factor)
 [1] PLACEBO    MEDIUMDOSE LOWDOSE    LOWDOSE    PLACEBO    MEDIUMDOSE
 [7] MEDIUMDOSE HIGHDOSE   HIGHDOSE   PLACEBO    LOWDOSE    CONTROL   
[13] HIGHDOSE   LOWDOSE    CONTROL    LOWDOSE    CONTROL    PLACEBO   
[19] LOWDOSE    MEDIUMDOSE
Levels: CONTROL LOWDOSE MEDIUMDOSE HIGHDOSE PLACEBO
[1] "CONTROL"    "LOWDOSE"    "MEDIUMDOSE" "HIGHDOSE"   "PLACEBO"

Collapsing Levels

fct_collapse() is used to group existing levels into new, broader categories. This is useful for simplifying factors or creating higher-level groupings.

You can define multiple new groups simultaneously. Let’s categorize “LowDose” as “Low”, combine “MediumDose” and “HighDose” into “MediumOrHigh”, and group “Control” and “Placebo” as “Inactive”.

# Let's say "LowDose" is "Low", "MediumDose" and "HighDose" are "MediumOrHigh",
# and "Control" and "Placebo" form an "Inactive" group.
collapsed_factor_multi <- fct_collapse(experimental_data,
                                       Low = "LowDose",
                                       MediumOrHigh = c("MediumDose", "HighDose"),
                                       Inactive = c("Control", "Placebo"))
collapsed_factor_multi
levels(collapsed_factor_multi) # Note: order of new levels is based on first appearance of old levels
 [1] Inactive     MediumOrHigh Low          Low          Inactive    
 [6] MediumOrHigh MediumOrHigh MediumOrHigh MediumOrHigh Inactive    
[11] Low          Inactive     MediumOrHigh Low          Inactive    
[16] Low          Inactive     Inactive     Low          MediumOrHigh
Levels: Inactive Low MediumOrHigh
[1] "Inactive"     "Low"          "MediumOrHigh"

Note that the order of the new levels (“Inactive”, “Low”, “MediumOrHigh”) is determined by the first appearance of their constituent old levels in the original factor’s level order.

Dropping Unused Levels

Sometimes, after subsetting data or other operations, a factor may retain levels that are no longer present in the actual data values. fct_drop() removes these unused levels.

Let’s create a subset of our experimental_data that only contains “Control” and “LowDose” observations.

# Suppose we subset our data, and some treatments are no longer present.
subset_data <- experimental_data[experimental_data %in% c("Control", "LowDose")]
subset_data # The factor still retains all original levels

# Levels before dropping
levels(subset_data)
[1] LowDose LowDose LowDose Control LowDose Control LowDose Control LowDose
Levels: Control LowDose MediumDose HighDose Placebo
[1] "Control"    "LowDose"    "MediumDose" "HighDose"   "Placebo"

Even though subset_data only has “Control” and “LowDose”, its levels still include all original treatment groups. Now, let’s use fct_drop().

# Drop unused levels
dropped_levels_factor <- fct_drop(subset_data)
dropped_levels_factor

# Levels after dropping
levels(dropped_levels_factor)
[1] LowDose LowDose LowDose Control LowDose Control LowDose Control LowDose
Levels: Control LowDose
[1] "Control" "LowDose"

The levels “MediumDose”, “HighDose”, and “Placebo” have been dropped.

[1] "A" "B" "C" "D"
[1] "A" "B"

Levels “C” and “D” are dropped as they are not present in the data c("A", "B").

Conclusion

The forcats package significantly simplifies working with factors in R. Functions like fct_relevel, fct_recode, fct_relabel, fct_collapse, and fct_drop provide a cohesive and powerful toolkit for common factor manipulation tasks. By mastering these functions, you can ensure your categorical data is correctly structured for analysis, visualization, and modeling, leading to more robust and interpretable results.