2

I'm looking to standardize some code which deals with cleaning data which has different column names over time. The idea is to create a dictionary along with a function which searches if a given dataset has names in the dictionary, and then replaces the names with the correct name (housed in the dictionary).

In the example below, 'Sepal.Length' would be converted to 'sepal_length'.

column_dict <- tibble(
from = c('Sepal.Length', 'length_of_sepal', 'sepal.lgth'),
to = c('sepal_length', 'sepal_length', 'sepal_length')
)

iris %>%
  as_tibble %>%
  map2(., column_dict, rename)
spazznolo
  • 747
  • 3
  • 9

1 Answers1

3

You can just pass a named vector as your dictionary to dplyr::rename(). Here you will want to take advantage of any_of() to build in flexibility to not require all of the dictionary terms to be present.

library(tidyverse)

old_names <- c('Sepal.Length', 'length_of_sepal', 'sepal.lgth')
new_names <- c('sepal_length', 'sepal_length', 'sepal_length')

# create named vector as dictionary
naming_key <- setNames(object = old_names, nm = new_names)

# rename according to naming key with any_of() in case there are missing columns in data
iris %>%
  tibble() %>% 
  rename(any_of(naming_key))
#> # A tibble: 150 x 5
#>    sepal_length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

Created on 2022-02-18 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28
  • Could this be generalized to a dictionary in the tibble format? I'd like to store it in a csv and call to it as needed. – spazznolo Feb 18 '22 at 05:47
  • 1
    You could just create the named vector from two columns of the `tibble` right? – Dan Adams Feb 18 '22 at 05:51
  • E.g. `old_names <- column_dict$from` and `new_names <- column_dict$to` and then use this same code. – Dan Adams Feb 18 '22 at 05:53
  • 2
    @spazznolo just do `iris %>% rename(any_of(invoke(set_names, unname(column_dict))))` – Onyambu Feb 18 '22 at 05:55
  • 1
    For what it's worth, the tibble occupies 2.5X the memory of the named vector. May become relevant if your data becomes very large. `object.size(column_dict) #> 1304 bytes`, `object.size(naming_key) #> 528 bytes` – Dan Adams Feb 18 '22 at 05:57
  • 1
    or even `iris %>% rename(any_of(set_names(column_dict$from, column_dict$to)))` – Onyambu Feb 18 '22 at 05:58