1

I have a large data-set comprised of many very specific variables. I am looking for a way to simplify the column names using a quick method rather than manually changing over 1000 columns.


total.population.2020 <- c("1","2" )
total.population.2020.both.sexes <- c("3", "4")
total.population.2020.sexes.males.14.to.16.years <- c("7", "9")
total.income.2020 <- c("55", "40")
total.income.2020.25.to.30.years <- c("80", "90")
            
df <- data.frame(total.population.2020, total.population.2020.both.sexes, total.population.2020.sexes.males.14.to.16.years, total.income.2020, total.income.2020.25.to.30.years)

I ran the clean_names function from Janitor first because using gsub/abbreviate on the original df would wipe out the whole column name leaving it empty rather than simplifying it.


library(janitor)
df <- clean_names(df)

Then I would run gsub/abbreviate. However I am running into the problem that the abbreviations are still very long (10 characters+) and missing numbers that I would count as important identifiers (ex. Age 20 to 25). This requires me to still manually change the column names.

names(df) <- abbreviate(gsub("_", " ", names(df)))
df

Is there a simpler method? How would you approach a large data-set with long column names?

Expected Output:

Old New
total.population.2020 tp
total.population.2020.both.sexes tpb
total.population.2020.sexes.males.14.to.16.years tpm14_16
total.income.2020 ti
total.income.2020.25.to.30.years ti25_30
user4157124
  • 2,809
  • 13
  • 27
  • 42
asixet
  • 11
  • 2
  • 1
    This is better. It still might not be super-easy. You could start by using something like `gsub("(.).*\\.", "\\1", nm)` to grab only the first letter of each dot-separated element (or use `strsplit` with a dot-separator first, then use `substr` to get the first letter); then apply other special-case rules like converting "20xx" to "xx" when xx are numeric; etc. (More fussing around with regexes than I have time for right now ...) – Ben Bolker Oct 05 '21 at 23:34
  • There is another issue with that expectected. Suppose yu have a column name with `"total.population.2021". I assume your New is takingthe first 2 digits ? or last 2 digits? Also `total.income.2020.25.to.30.years` the expected is ti25_30 if there is 2021, it duplicates – akrun Oct 05 '21 at 23:34
  • @akrun There is no 2021, all the data comes from the same year (2020 for this example). I will correct the post and the 20 can be removed from new. – asixet Oct 06 '21 at 01:34
  • I think an easy way would be to create a lookup table (similar to your expected output) and change the column names by matching the column names with that data. – Ronak Shah Oct 06 '21 at 05:09
  • @RonakShah Wouldn't a lookup table mean I would need to manually create 1000+ variable names? Or would there be a faster method than manually inputting variable name by variable name in a table? – asixet Oct 07 '21 at 19:23
  • @BenBolker I think your suggestion is the solution I am going with. I am new to Regex so I have been learning and playing around with it. I have made some progress with it but still need more experience on my part. Thanks for the guidance. – asixet Oct 18 '21 at 04:14

0 Answers0