1

I have data in Excel that comes from an application that summarizes that data into different tables. The data looks good in Excel, but when I try to import it into R some columns are skipped and not aligned. I need to tidy up the data so that I can plot it.

Below is a reproducible sample.

df <- data.frame( ` `  =  c("cars","buses","","under 1yr","1-2 yrs","2-5 yrs",">5 yrs"),
              fcltA =  c("1","5","","","","","" ),
              ` `  =  c("","","fcltA","5","","","1"),
              fcltB =  c("6","","","","","",""),
              ` `  =  c("","","fcltB","3","","2","1"),
              fcltC =  c("2","2","","","","",""),
              ` ` =  c("","","fcltC","1","2","","1"),
 check.names = FALSE, fix.empty.names = FALSE)

Below is what I want

dfClnd <- data.frame( ` `  =  c("cars","buses","","under 1yr","1-2 yrs","2-5 yrs",">5 yrs"),
                  fcltA =  c("1","5"," fcltA","5","","","1" ),
                  fcltB =  c("3","3","fcltB","3","","2","1"), 
                  fcltC =  c("2","2","fcltC","1","2","","1"),
                  check.names = FALSE, fix.empty.names = FALSE)

I found this question but it does not work well for my problem because it shifts some values to the incorrect columns.

Below is a sample of how the data looks like:

enter image description here

user3357059
  • 1,122
  • 1
  • 15
  • 30
  • 1
    Is it safe to assume that this is a contrived example and that you cannot correct the underlying spreadsheet? It appears that your table is intended solely for display and not for processing, can you provide some context to make sure what you intend is a good direction. – r2evans Dec 26 '17 at 16:36
  • I added a picture of the data in Excel, columns go from A to Z but only a few columns have data. – user3357059 Dec 26 '17 at 17:55
  • I suggest that a more robust approach would involve subsetting portions of the spreadsheet instead of grabbing "too much at once". Which package are you using the get the data from the spreadsheet? – r2evans Dec 26 '17 at 17:59
  • I'm currently using the function `readxl::read_excel`. I am doing this for several reports, which each have several summary tables, for several periods. The same report in a different period may house the same data in a different column. – user3357059 Dec 26 '17 at 18:10
  • Hehe, that's fairly challenging to dynamically parse: inconsistently merged-cells, text immediately below the table. I think your approach would benefit *significantly* from being able to separate the tables (so that you don't have column names in the contents of the preceding table). – r2evans Dec 26 '17 at 18:39

1 Answers1

1

A solution using and . Notice that when creating df I set stringsAsFactors = FALSE to avoid creating factor columns. This is because the coalesce function will not work on different factor levels. df3 is the final output.

library(dplyr)
library(purrr)

# Replace "" with NA
df[df == ""] <- NA

# Get the new column names    
NewCol <- names(df)
NewCol <- NewCol[!NewCol %in% " "]

# Conduct the merge of columns
df2 <- map_dfc(NewCol, function(x){
  df_temp <- df[which(names(df) %in% x) + c(0, 1)]
  df_out <- as_data_frame(coalesce(df_temp[, 1], df_temp[, 2])) %>%
    setNames(x)
  return(df_out)
})

# Merge the first column with the new data frame
# Replace NA with ""
df3 <- bind_cols(df[, 1, drop = FALSE], df2)
df3[is.na(df3)] <- ""
df3
#             fcltA fcltB fcltC
# 1      cars     1     6     2
# 2     buses     5           2
# 3           fcltA fcltB fcltC
# 4 under 1yr     5     3     1
# 5   1-2 yrs                 2
# 6   2-5 yrs           2      
# 7    >5 yrs     1     1     1

DATA

df <- data.frame( ` `  =  c("cars","buses","","under 1yr","1-2 yrs","2-5 yrs",">5 yrs"),
                  fcltA =  c("1","5","","","","","" ),
                  ` `  =  c("","","fcltA","5","","","1"),
                  fcltB =  c("6","","","","","",""),
                  ` `  =  c("","","fcltB","3","","2","1"),
                  fcltC =  c("2","2","","","","",""),
                  ` ` =  c("","","fcltC","1","2","","1"),
                  check.names = FALSE, fix.empty.names = FALSE,
                  stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84