Programmatically rename dataframe columns using abbreviations derived from labels

Question

I have a labelled dataframe with arbitrary column names, and I would like to rename the columns in a non-arbitrary way using the labels.

Here is a simplified version of the dataframe:

library(labelled)
library(tidyverse)

df <- tibble(id = "a", B101 = 1, B102 = 2, B103 = 3, B104 = .1)

For the columns to be renamed, each label has two to three components (delimited by a colon + single space):

var_label(df) <-
  list(
    id = "ID",
    B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
    B102 = "Estimates: $10,000 to $19,999: 20.0 to 24.9 percent",
    B103 = "Estimates: $10,000 to $19,999",
    B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
  )

So, the label for each column may have two components (e.g., B103) or three components (e.g., B102). If the label has none of the three components (e.g., id), then the column need not be renamed.

I would like to abbreviate the components of the labels as follows:

Component 1
- "Estimates: " -> e
- "Margins of error: " -> m
Component 2
- "Less than $10,000: " or "Less than $10,000" -> i0to9
- "$10,000 to $19,999: " or "$10,000 to $19,999" -> i10to19
Component 3
- "Less than 20.0 percent" -> p0to19
- "20.0 to 24.9 percent" -> p20to24

Then, I would like rename each variable by concatenating the components, which would be separated by underscores. Obviously, it's trivial to do this by hand in an ad hoc way:

df %>% 
  rename(e_i0to9_p0to19 = B101,
         e_i10to19_p20to24 = B102,
         e_i10to19 = B103, 
         m_i0to9_p0to19 = B104)

But how can I accomplish this programmatically using tidyverse principles and packages?

The third column doesn't have component 3. In general, what should we assume about your data? That each column may or may not contain any number of the three components? Or is it just the third that might be omitted? — kybazzi, Jan 13 '22 at 22:34

Baraliuh · Answer 1 · 2022-01-13T23:19:19.583

1

Here is one way of doing it:

list(
    id = "ID",
    B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
    B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
    B103 = "Estimates: $10,000 to $19,999",
    B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
) %>% 
    stringr::str_replace_all(
        c('Estimates: ' = 'e_', "Margins of error: " = 'm', "Less than \\$10,000: " = 'i0to9',
          "\\$10,000 to \\$19,999[:]?[ ]?" = 'i10to19', "Less than 20.0 percent" = 'p0to19', 
          "20.0 to 24.9 percent" = 'p20to24')
    ) %>% 
    setNames(names(df), .) %>% 
    rename(df, .)

Output:

# A tibble: 1 x 5
  ID    e_i0to9p0to19 e_i0to9p20to24 e_i10to19 mi0to9p0to19
  <chr>         <dbl>          <dbl>     <dbl>        <dbl>
1 a                 1              2         3          0.1

edited Jan 13 '22 at 23:19

answered Jan 13 '22 at 22:44

Baraliuh

2,009
5
11

The group $10,000 to $19,999 might have a colon at the end - at least it should be assumed so even if the data provided didn't have one. – kybazzi Jan 13 '22 at 22:47
I assume nothing about things not provided by OP. – Baraliuh Jan 13 '22 at 22:48
The OP states the mapping including a comma - this is directly provided. – kybazzi Jan 13 '22 at 22:48
Then OP will not get the results he is asking for as that would be a contradiction to the expected out from the example provided. – Baraliuh Jan 13 '22 at 22:51
That seems reasonable - it could definitely be clarified by the OP. – kybazzi Jan 13 '22 at 22:52
@kybazzi was correct -- the group $10,000 to $19,999 has a colon at the end when it is in a 3-component group, but not when it is in a 2-component group. – nicholas Jan 13 '22 at 23:00
@nicholas Then your post is an oxymoron. You state `e_i10to19 = B103` but one cannot map `"Estimates: $10,000 to $19,999"` -> `e_i10to19` using `"$10,000 to $19,999: "` – Baraliuh Jan 13 '22 at 23:05
@Baraliuh Why? I think that TarJae's solution does it (with one line of extra code `%>% map(., ~str_replace_all(., ":", "")[[1]])`). – nicholas Jan 13 '22 at 23:11
@nicholas That is not how you defined the mapping. I did not say it is not possible to code. – Baraliuh Jan 13 '22 at 23:13
Ah -- I see what you mean. – nicholas Jan 13 '22 at 23:14
@nicholas I've updated my answer to be able to deal with both with `: ` and without. – Baraliuh Jan 13 '22 at 23:15

score 1 · Answer 2 · edited Jan 13 '22 at 23:12

We could modify the attributes of the dataframe directly with var_label(df) from labelled package, as it is already used.

You will get a list. Then you can iterate over the list with map. I am not very happy about repeating map(., ... but at this time I have no idea how to apply the DRY (do not repeat yourself) idea:

library(tidyverse)
library(labelled)

colnames(df) <- var_label(df) %>% 
  map(., ~str_replace(., "Estimates:", "e")[[1]]) %>% 
  map(., ~str_replace(., "Margins of error:", "m")[[1]]) %>% 
  map(., ~str_replace(., "Less than \\$10,000\\:", "i0to9")[[1]]) %>% 
  map(., ~str_replace(., "\\$10,000 to \\$19,999", "i10to19")[[1]]) %>% 
  map(., ~str_replace(., "Less than 20.0 percent", "p0to19")[[1]]) %>% 
  map(., ~str_replace(., "20.0 to 24.9 percent", "p20to24")[[1]]) %>% 
  map(., ~str_replace_all(., " ", "_")[[1]]) %>% 
  map(., ~str_replace_all(., ":", "")[[1]])

  e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
           <dbl>           <dbl>     <dbl>          <dbl>
1              1               2         3            0.1

Thanks -- this almost works, but I had to modify the question in response to @kybazzi. When this code run on the labels from the revised question, there is still a colon in one of the variables. — nicholas, Jan 13 '22 at 23:06
The code works for the revised question if we add `%>% map(., ~str_replace_all(., ":", "")[[1]])`. — nicholas, Jan 13 '22 at 23:12

kybazzi · Accepted Answer · 2022-01-13T23:40:54.750

Here is a slightly lengthy solution, with the goal of being highly flexible for any changes to the structure or value of your mappings. I recommend the other great answers already given here if your problem is a one-time-only sort of thing. I go over the benefits of this solution at the end.

First define your mappings in a table - this allows you to easily change them in the future or add new ones if necessary:

library(tidyverse)
labels = list(
  B101 = "Estimates: Less than $10,000: Less than 20.0 percent", 
  B102 = "Estimates: Less than $10,000: 20.0 to 24.9 percent",
  B103 = "Estimates: $10,000 to $19,999",
  B104 = "Margins of error: Less than $10,000: Less than 20.0 percent"
)

components = tribble(
  ~ id, ~ name, ~ new_name,
  1, "Estimates", "e",
  1, "Margins of error", "m",
  2, "Less than $10,000", "i0to9",
  2, "$10,000 to $19,999", "i10to19",
  3, "Less than 20.0 percent", "p0to19",
  3, "20.0 to 24.9 percent", "p20to24"
)

From this we can generate a regex:

component_regex = components %>%
  split(.$id) %>%
  # Fix dollar signs
  map(~ str_replace_all(.x$name, "\\$", "\\\\$")) %>%
  # Include a regex condition for the possibly of there being a colon
  map(~ map_chr(.x, paste0, "[\\:]?")) %>%
  map_chr(paste, collapse = "|") %>%
  # Some components may not be present
  paste0("(", ., ")?") %>%
  # Spaces in between each component
  paste(collapse = "[ ]?")

Here is the regex:

component_regex
#> [1] "(Estimates[\\:]?|Margins of error[\\:]?)?[ ]?(Less than \\$10,000[\\:]?|\\$10,000 to \\$19,999[\\:]?)?[ ]?(Less than 20.0 percent[\\:]?|20.0 to 24.9 percent[\\:]?)?"

Now we extract the components from each label to create a data frame:

data_labels = labels %>% 
  map(str_match, pattern = component_regex) %>%
  map(as.data.frame) %>% 
  reduce(bind_rows) %>%
  select(-V1) %>%
  map_df(str_replace, pattern = ":$", replacement = "") %>%
  mutate(col_name = names(labels))

# A tibble: 4 x 4
  V2               V3                 V4                     col_name
  <chr>            <chr>              <chr>                  <chr>   
1 Estimates        Less than $10,000  Less than 20.0 percent B101    
2 Estimates        Less than $10,000  20.0 to 24.9 percent   B102    
3 Estimates        $10,000 to $19,999 NA                     B103    
4 Margins of error Less than $10,000  Less than 20.0 percent B104

Now we transform this table so that we could join the components table from before and extract the new name. I'll first show the partial result so you can see what's going on:

data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name"))

# A tibble: 11 x 4
   col_name    id value                  new_name
   <chr>    <dbl> <chr>                  <chr>   
 1 B101         1 Estimates              e       
 2 B101         2 Less than $10,000      i0to9   
 3 B101         3 Less than 20.0 percent p0to19  
 4 B102         1 Estimates              e       
 5 B102         2 Less than $10,000      i0to9   
 6 B102         3 20.0 to 24.9 percent   p20to24 
 7 B103         1 Estimates              e       
 8 B103         2 $10,000 to $19,999     i10to19 
 9 B104         1 Margins of error       m       
10 B104         2 Less than $10,000      i0to9   
11 B104         3 Less than 20.0 percent p0to19

Note that the inner_join() makes it so that cases where there was no third component are omitted from the data. Here's how to finish it:

new_names = data_labels %>%
  pivot_longer(-col_name, names_to = "id") %>%
  # Generate the component id
  mutate(id = as.numeric(str_extract_all(id, "[0-9]+")) - 1) %>%
  inner_join(components, by = c("id", "value" = "name")) %>%
  group_by(col_name) %>%
  summarise(final_name = paste(new_name[sort(id)], collapse = "_"))

# A tibble: 4 x 2
  col_name final_name     
  <chr>    <chr>          
1 B101     e_i0to9_p0to19 
2 B102     e_i0to9_p20to24
3 B103     e_i10to19      
4 B104     m_i0to9_p0to19

We now just replace the names with the new ones:

old_names = intersect(names(df), new_names$col_name)
df %>% 
  rename_with(
    ~ new_names$final_name[which(old_names == .x)], 
    .cols = all_of(old_names)
  )
# A tibble: 1 x 5
  id    e_i0to9_p0to19 e_i0to9_p20to24 e_i10to19 m_i0to9_p0to19
  <chr>          <dbl>           <dbl>     <dbl>          <dbl>
1 a                  1               2         3            0.1

This solution may seem lengthy, but here are some benefits of it:

The mapping can be stored in a CSV file and modified outside of the code. That is, the code does not actually depend on your mapping.
You can add or remove parts of each component.
It works whether any of the components are missing.
It works with more than three components.

I love this solution, and would like to accept it as the answer. However, it is failing at the very last line of code, which yields the error "Error: Names repair functions can't return `NA` values." — nicholas, Jan 13 '22 at 23:34
In your data are there columns with none of the components? This is the one case I missed, which I'll fix now. — kybazzi, Jan 13 '22 at 23:35
Yes -- the id column has none of the components. (I revised the question to include this.) — nicholas, Jan 13 '22 at 23:36
Apologies I missed the `id` column when I was testing my solution. I've updated the very last code chunk in my answer to only replace the columns that are present in `new_names`, which only contains columns that had at least one of the components. Let me know if that fixes it (seems to work on my end). — kybazzi, Jan 13 '22 at 23:42

score 1 · Answer 4 · answered Jan 13 '22 at 23:39

df %>%
  set_names(var_label(.) %>%
  unlist() %>%
  str_replace_all(c("Estimates: " = 'e',
      "Margins of error:" = "m",
      "Less than \\$10,000:?" = "i0to9",
      "\\$10,000 to \\$19,999" ="i10to19",
      "Less than 20.0 percent" = "p0to19",
      "20.0 to 24.9 percent" = "p20to24",
      ' ' = '_')))
# A tibble: 1 x 5
  ID    ei0to9_p0to19 ei0to9_p20to24 ei10to19 m_i0to9_p0to19
  <chr>         <dbl>          <dbl>    <dbl>          <dbl>
1 a                 1              2        3            0.1

Programmatically rename dataframe columns using abbreviations derived from labels

4 Answers4