Here is a slightly lengthy solution, with the goal of being highly flexible for any changes to the structure or value of your mappings. I recommend the other great answers already given here if your problem is a one-time-only sort of thing. I go over the benefits of this solution at the end.
First define your mappings in a table - this allows you to easily change them in the future or add new ones if necessary:
library(tidyverse)
labels = list(
B101 = "Estimates: Less than $10,000: Less than 20.0 percent",
B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
B103 = "Estimates: $10,000 to $19,999",
B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)
components = tribble(
~ id, ~ name, ~ new_name,
1, "Estimates", "e",
1, "Margins of error", "m",
2, "Less than $10,000", "i0to9",
2, "$10,000 to $19,999", "i10to19",
3, "Less than 20.0 percent", "p0to19",
3, "20.0 to 24.9 percent", "p20to24"
)
From this we can generate a regex:
component_regex = components %>%
split(.$id) %>%
# Fix dollar signs
map(~ str_replace_all(.x$name, "\\$", "\\\\$")) %>%
# Include a regex condition for the possibly of there being a colon
map(~ map_chr(.x, paste0, "[\\:]?")) %>%
map_chr(paste, collapse = "|") %>%
# Some components may not be present
paste0("(", ., ")?") %>%
# Spaces in between each component
paste(collapse = "[ ]?")
Here is the regex:
component_regex
#> [1] "(Estimates[\\:]?|Margins of error[\\:]?)?[ ]?(Less than \\$10,000[\\:]?|\\$10,000 to \\$19,999[\\:]?)?[ ]?(Less than 20.0 percent[\\:]?|20.0 to 24.9 percent[\\:]?)?"
Now we extract the components from each label to create a data frame:
data_labels = labels %>%
map(str_match, pattern = component_regex) %>%
map(as.data.frame) %>%
reduce(bind_rows) %>%
select(-V1) %>%
map_df(str_replace, pattern = ":$", replacement = "") %>%
mutate(col_name = names(labels))
# A tibble: 4 x 4
V2 V3 V4 col_name
<chr> <chr> <chr> <chr>
1 Estimates Less than $10,000 Less than 20.0 percent B101
2 Estimates Less than $10,000 20.0 to 24.9 percent B102
3 Estimates $10,000 to $19,999 NA B103
4 Margins of error Less than $10,000 Less than 20.0 percent B104
Now we transform this table so that we could join the components
table from before and extract the new name. I'll first show the partial result so you can see what's going on:
data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name"))
# A tibble: 11 x 4
col_name id value new_name
<chr> <dbl> <chr> <chr>
1 B101 1 Estimates e
2 B101 2 Less than $10,000 i0to9
3 B101 3 Less than 20.0 percent p0to19
4 B102 1 Estimates e
5 B102 2 Less than $10,000 i0to9
6 B102 3 20.0 to 24.9 percent p20to24
7 B103 1 Estimates e
8 B103 2 $10,000 to $19,999 i10to19
9 B104 1 Margins of error m
10 B104 2 Less than $10,000 i0to9
11 B104 3 Less than 20.0 percent p0to19
Note that the inner_join()
makes it so that cases where there was no third component are omitted from the data. Here's how to finish it:
new_names = data_labels %>%
pivot_longer(-col_name, names_to = "id") %>%
# Generate the component id
mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
inner_join(components, by = c("id", "value" = "name")) %>%
group_by(col_name) %>%
summarise(final_name = paste(new_name[sort(id)], collapse = "_"))
# A tibble: 4 x 2
col_name final_name
<chr> <chr>
1 B101 e_i0to9_p0to19
2 B102 e_i0to9_p20to24
3 B103 e_i10to19
4 B104 m_i0to9_p0to19
We now just replace the names with the new ones:
old_names = intersect(names(df), new_names$col_name)
df %>%
rename_with(
~ new_names$final_name[which(old_names == .x)],
.cols = all_of(old_names)
)
# A tibble: 1 x 5
id e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 1 2 3 0.1
This solution may seem lengthy, but here are some benefits of it:
- The mapping can be stored in a CSV file and modified outside of the code. That is, the code does not actually depend on your mapping.
- You can add or remove parts of each component.
- It works whether any of the components are missing.
- It works with more than three components.