0

I have five excel files with 2 sheets per file.

-file_2015: a, b

-file_2016: a, b

-file_2017: a, b

-file_2018: a, b

Both sheets a and b provide the same data over time. They vary in the number of columns because new indicators are added to newer files that do not exist in older ones.

The column names change slightly and I would like to consolidate this data by appending it together.

Examples of variable names that are similar:

-Metric Rating - Unit level, Percentage Earning 5 Credit in Third Year

-Metric Rating - 5 Credits in 3rd Year - Unit level

The other problems I am running into is the fact that the same indicators from different files have been named with the corresponding year. For example:

-Target For 2016 - Grad Rate, 6-Year

-Target For 2017 - 6-Year Grad Rate

Although they are the same data indicator, it is producing two columns when appending data.

Currently, I am using this function to read in all of the excel files and append each sheet data together as one dataframe.

df<-list.files(path = filepath_raw_data, full.names = TRUE, pattern = "*.xlsx") %>%
    #Keetping the names of the filepath as a data columns
  set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
  map_df(readxl::read_excel, sheet=sheet_name, col_types = "text",  .id="file_name")

Is there any way I can use pattern matching to match similar columns names together and append the data together?

brin
  • 35
  • 1
  • 6
  • If you can come up with a bunch of regex rules to bin the columns together like `grepl("Target") & grepl("Grad Rate") & grepl("6-Year")` then it's relatively straightforward, but tedious, to rename the columns consistently using `mutate` before binding them together. If you're asking how to automatically link columns with somewhat similar names, that's much harder. – pgcudahy Aug 30 '19 at 20:04
  • Thanks! The reason I decided not to use grepl is that I have about 100+ different variables in the same situation. I am trying to figure out way to automatically link the columns with a similar name using RecordLinkage or stringdist. – brin Sep 01 '19 at 13:44

0 Answers0