Special Type: Factors
Factors are R’s way of handling categorical data, where variables can only take on a limited number of distinct values, known as levels. While base R provides functions for working with factors, the forcats
package, part of the Tidyverse, offers a suite of intuitive and powerful tools that make factor manipulation much more straightforward. This article explores common operations you’ll perform on factors using forcats
, such as reordering, renaming, collapsing, and dropping levels.
Setup and Sample Factor
First, let’s load the tidyverse
(which includes forcats
). We’ll then create a sample factor representing different treatment groups in a hypothetical experiment. This will be our working example throughout the article.
library(tidyverse) # Automatically loads forcats as well
# Let's create a sample factor to work with.
# This represents different treatment groups in an experiment.
treatment_levels <- c("Control", "LowDose", "MediumDose", "HighDose", "Placebo")
experimental_data <- factor(sample(treatment_levels, 20, replace = TRUE), levels = treatment_levels)
# Initial levels (ordered as defined)
experimental_data
[1] HighDose LowDose LowDose Control Placebo Control
[7] MediumDose Control Control HighDose HighDose MediumDose
[13] Placebo MediumDose Control LowDose Placebo LowDose
[19] Placebo LowDose
Levels: Control LowDose MediumDose HighDose Placebo
Our experimental_data
factor has five levels, and their initial order is determined by how we defined treatment_levels
.
Re-ordering Levels
The order of factor levels is crucial for visualizations and statistical modeling, as it can affect the reference category in regressions or the display order in plots. fct_relevel()
provides precise control over this order. You can either specify the complete new order of levels, or move specific levels to particular positions using the after
argument:
- If you list levels without
after
, or withafter = 0
, those levels are moved to the front, and their relative order is maintained. This is the default behavior if you only specify the levels to move. - Using
after = Inf
(infinity) moves the specified levels to the end of the factor order. - Using
after = n
, wheren
is an integer, moves the specified levels to the position immediately after the n-th current level. For example,after = 1
places the specified levels after the first current level.
Specify the Full New Order
You can provide all levels in their desired new order. Let’s say we want “Control” first, then “Placebo”, followed by the dosage levels.
# Let's say we want "Control" first, then "Placebo", then the doses.
fct_relevel(experimental_data, "Control", "Placebo", "LowDose", "MediumDose", "HighDose")
levels(fct_relevel(experimental_data, "Control", "Placebo", "LowDose", "MediumDose", "HighDose"))
[1] Placebo MediumDose LowDose LowDose Placebo MediumDose
[7] MediumDose HighDose HighDose Placebo LowDose Control
[13] HighDose LowDose Control LowDose Control Placebo
[19] LowDose MediumDose
Levels: Control Placebo LowDose MediumDose HighDose
[1] "Control" "Placebo" "LowDose" "MediumDose" "HighDose"
Move Levels to the Front
To move one or more levels to the beginning, specify the level(s) and use after = 0
(or omit after
if it’s the first argument after the factor itself). The remaining levels will keep their relative order.
# Move "Placebo" to be the first level, keeping other orders relative.
# `after = 0` places it at the beginning. This is also the default if `after` is omitted and levels are specified.
fct_relevel(experimental_data, "Placebo", after = 0)
levels(fct_relevel(experimental_data, "Placebo", after = 0))
[1] Placebo MediumDose LowDose LowDose Placebo MediumDose
[7] MediumDose HighDose HighDose Placebo LowDose Control
[13] HighDose LowDose Control LowDose Control Placebo
[19] LowDose MediumDose
Levels: Placebo Control LowDose MediumDose HighDose
[1] "Placebo" "Control" "LowDose" "MediumDose" "HighDose"
Move Levels to the End
Using after = Inf
moves specified levels to the end of the factor order.
# Move "Control" to be the last level.
# `after = Inf` places it at the end.
fct_relevel(experimental_data, "Control", after = Inf)
levels(fct_relevel(experimental_data, "Control", after = Inf))
[1] Placebo MediumDose LowDose LowDose Placebo MediumDose
[7] MediumDose HighDose HighDose Placebo LowDose Control
[13] HighDose LowDose Control LowDose Control Placebo
[19] LowDose MediumDose
Levels: LowDose MediumDose HighDose Placebo Control
[1] "LowDose" "MediumDose" "HighDose" "Placebo" "Control"
Move Levels After a Specific Position
You can use an integer with after
to place levels after a specific existing level. For instance, after = 1
will place the specified level(s) after the first level in the current order of the factor.
# Move "HighDose" to be after the first level in the current order.
# Original order for experimental_data is Control, LowDose, MediumDose, HighDose, Placebo
# So, HighDose will be moved after "Control".
fct_relevel(experimental_data, "HighDose", after = 1)
levels(fct_relevel(experimental_data, "HighDose", after = 1))
[1] Placebo MediumDose LowDose LowDose Placebo MediumDose
[7] MediumDose HighDose HighDose Placebo LowDose Control
[13] HighDose LowDose Control LowDose Control Placebo
[19] LowDose MediumDose
Levels: Control HighDose LowDose MediumDose Placebo
[1] "Control" "HighDose" "LowDose" "MediumDose" "Placebo"
Re-naming Factor Levels
forcats
provides flexible ways to rename factor levels, either by specifying new names directly or by applying a function.
Rename Specific Levels Manually
fct_recode()
allows you to change the names of specific levels. The syntax is "NewName" = "OldName"
.
# Let's shorten "MediumDose" to "MedDose" and "HighDose" to "HiDose"
recoded_factor <- fct_recode(experimental_data,
"MedDose" = "MediumDose",
"HiDose" = "HighDose")
recoded_factor
[1] Placebo MedDose LowDose LowDose Placebo MedDose MedDose HiDose HiDose
[10] Placebo LowDose Control HiDose LowDose Control LowDose Control Placebo
[19] LowDose MedDose
Levels: Control LowDose MedDose HiDose Placebo
Rename Levels Using a Function
fct_relabel()
applies a function to all level names. The function you provide should take a character vector (the current levels) and return a character vector (the new levels).
Let’s prepend “Group: ” to each level:
# Let's prepend "Group: " to each level name.
# The function provided to fct_relabel takes the current level names as input.
relabeled_factor <- fct_relabel(experimental_data, ~ paste("Group:", .))
relabeled_factor
levels(relabeled_factor)
[1] Group: Placebo Group: MediumDose Group: LowDose Group: LowDose
[5] Group: Placebo Group: MediumDose Group: MediumDose Group: HighDose
[9] Group: HighDose Group: Placebo Group: LowDose Group: Control
[13] Group: HighDose Group: LowDose Group: Control Group: LowDose
[17] Group: Control Group: Placebo Group: LowDose Group: MediumDose
5 Levels: Group: Control Group: LowDose Group: MediumDose ... Group: Placebo
[1] "Group: Control" "Group: LowDose" "Group: MediumDose"
[4] "Group: HighDose" "Group: Placebo"
Or convert all levels to uppercase:
# Another `fct_relabel` example: Convert to uppercase
relabeled_upper_factor <- fct_relabel(experimental_data, toupper)
relabeled_upper_factor
levels(relabeled_upper_factor)
[1] PLACEBO MEDIUMDOSE LOWDOSE LOWDOSE PLACEBO MEDIUMDOSE
[7] MEDIUMDOSE HIGHDOSE HIGHDOSE PLACEBO LOWDOSE CONTROL
[13] HIGHDOSE LOWDOSE CONTROL LOWDOSE CONTROL PLACEBO
[19] LOWDOSE MEDIUMDOSE
Levels: CONTROL LOWDOSE MEDIUMDOSE HIGHDOSE PLACEBO
[1] "CONTROL" "LOWDOSE" "MEDIUMDOSE" "HIGHDOSE" "PLACEBO"
Collapsing Levels
fct_collapse()
is used to group existing levels into new, broader categories. This is useful for simplifying factors or creating higher-level groupings.
You can define multiple new groups simultaneously. Let’s categorize “LowDose” as “Low”, combine “MediumDose” and “HighDose” into “MediumOrHigh”, and group “Control” and “Placebo” as “Inactive”.
# Let's say "LowDose" is "Low", "MediumDose" and "HighDose" are "MediumOrHigh",
# and "Control" and "Placebo" form an "Inactive" group.
collapsed_factor_multi <- fct_collapse(experimental_data,
Low = "LowDose",
MediumOrHigh = c("MediumDose", "HighDose"),
Inactive = c("Control", "Placebo"))
collapsed_factor_multi
levels(collapsed_factor_multi) # Note: order of new levels is based on first appearance of old levels
[1] Inactive MediumOrHigh Low Low Inactive
[6] MediumOrHigh MediumOrHigh MediumOrHigh MediumOrHigh Inactive
[11] Low Inactive MediumOrHigh Low Inactive
[16] Low Inactive Inactive Low MediumOrHigh
Levels: Inactive Low MediumOrHigh
[1] "Inactive" "Low" "MediumOrHigh"
Note that the order of the new levels (“Inactive”, “Low”, “MediumOrHigh”) is determined by the first appearance of their constituent old levels in the original factor’s level order.
Dropping Unused Levels
Sometimes, after subsetting data or other operations, a factor may retain levels that are no longer present in the actual data values. fct_drop()
removes these unused levels.
Let’s create a subset of our experimental_data
that only contains “Control” and “LowDose” observations.
# Suppose we subset our data, and some treatments are no longer present.
subset_data <- experimental_data[experimental_data %in% c("Control", "LowDose")]
subset_data # The factor still retains all original levels
# Levels before dropping
levels(subset_data)
[1] LowDose LowDose LowDose Control LowDose Control LowDose Control LowDose
Levels: Control LowDose MediumDose HighDose Placebo
[1] "Control" "LowDose" "MediumDose" "HighDose" "Placebo"
Even though subset_data
only has “Control” and “LowDose”, its levels still include all original treatment groups. Now, let’s use fct_drop()
.
# Drop unused levels
dropped_levels_factor <- fct_drop(subset_data)
dropped_levels_factor
# Levels after dropping
levels(dropped_levels_factor)
[1] LowDose LowDose LowDose Control LowDose Control LowDose Control LowDose
Levels: Control LowDose
[1] "Control" "LowDose"
The levels “MediumDose”, “HighDose”, and “Placebo” have been dropped.
[1] "A" "B" "C" "D"
[1] "A" "B"
Levels “C” and “D” are dropped as they are not present in the data c("A", "B")
.
Conclusion
The forcats
package significantly simplifies working with factors in R. Functions like fct_relevel
, fct_recode
, fct_relabel
, fct_collapse
, and fct_drop
provide a cohesive and powerful toolkit for common factor manipulation tasks. By mastering these functions, you can ensure your categorical data is correctly structured for analysis, visualization, and modeling, leading to more robust and interpretable results.