How to approach long column names in R?

Question

I have a large data-set comprised of many very specific variables. I am looking for a way to simplify the column names using a quick method rather than manually changing over 1000 columns.


total.population.2020 <- c("1","2" )
total.population.2020.both.sexes <- c("3", "4")
total.population.2020.sexes.males.14.to.16.years <- c("7", "9")
total.income.2020 <- c("55", "40")
total.income.2020.25.to.30.years <- c("80", "90")
            
df <- data.frame(total.population.2020, total.population.2020.both.sexes, total.population.2020.sexes.males.14.to.16.years, total.income.2020, total.income.2020.25.to.30.years)

I ran the clean_names function from Janitor first because using gsub/abbreviate on the original df would wipe out the whole column name leaving it empty rather than simplifying it.


library(janitor)
df <- clean_names(df)

Then I would run gsub/abbreviate. However I am running into the problem that the abbreviations are still very long (10 characters+) and missing numbers that I would count as important identifiers (ex. Age 20 to 25). This requires me to still manually change the column names.

names(df) <- abbreviate(gsub("_", " ", names(df)))
df

Is there a simpler method? How would you approach a large data-set with long column names?

Expected Output:

Old	New
`total.population.2020`	tp
`total.population.2020.both.sexes`	tpb
`total.population.2020.sexes.males.14.to.16.years`	tpm14_16
`total.income.2020`	ti
`total.income.2020.25.to.30.years`	ti25_30

This is better. It still might not be super-easy. You could start by using something like `gsub("(.).*\\.", "\\1", nm)` to grab only the first letter of each dot-separated element (or use `strsplit` with a dot-separator first, then use `substr` to get the first letter); then apply other special-case rules like converting "20xx" to "xx" when xx are numeric; etc. (More fussing around with regexes than I have time for right now ...) — Ben Bolker, Oct 05 '21 at 23:34
There is another issue with that expectected. Suppose yu have a column name with `"total.population.2021". I assume your New is takingthe first 2 digits ? or last 2 digits? Also `total.income.2020.25.to.30.years` the expected is ti25_30 if there is 2021, it duplicates — akrun, Oct 05 '21 at 23:34
@akrun There is no 2021, all the data comes from the same year (2020 for this example). I will correct the post and the 20 can be removed from new. — asixet, Oct 06 '21 at 01:34
I think an easy way would be to create a lookup table (similar to your expected output) and change the column names by matching the column names with that data. — Ronak Shah, Oct 06 '21 at 05:09
@RonakShah Wouldn't a lookup table mean I would need to manually create 1000+ variable names? Or would there be a faster method than manually inputting variable name by variable name in a table? — asixet, Oct 07 '21 at 19:23
@BenBolker I think your suggestion is the solution I am going with. I am new to Regex so I have been learning and playing around with it. I have made some progress with it but still need more experience on my part. Thanks for the guidance. — asixet, Oct 18 '21 at 04:14

How to approach long column names in R?

0 Answers0