Proprietary software produces ugly excel tables...can I get them into R?

Question

We collect bat calls and analyze them and the outputs are a nightmare. I would like to streamline our process, but I'm stumped. How do I take dozens of files from excel, that look like this:

Excel output format

And get the import to add the rows, so that every set of four rows spaced by 2 rows (i.e. rows 3-6, 9-12, 15-18, etc--it's the same spacing for every project) are imported until empty space is reached (it'll be a different number of repetitions for each project)? I also want to endCol at empty space for each four-row segment...

I can specify the range for a given site easily, using:

   df<-t(readWorksheetFromFile("file directory",sheet=2,
        header=FALSE,startCol=2,startRow=3,endCol=5,endRow=6))

Then it gets pretty ugly, but I've done this:

    colnames(df)<-c("Species","n","percent","mle")
    BCID.df<-transform(BCID.df,Species=as.character(as.character(Species)),
    n=as.numeric(as.character(n)),percent=as.numeric(as.character(percent)),
               mle=as.numeric(as.character(mle)))

The output is formatted the way I want, but I need to fix the endRowand endCol, and don't know how...any suggestions would be hugely appreciated.

Just to clarify, is each chunk the same number of rows and columns? From the example it looks like yes. — Michael Davidson, Aug 29 '16 at 19:28
looking at the example, it seems that number of columns varies in each chunk. So what format should the import be in? I can think of various formats: E.g. 1) A list of chunks perhaps, and each chunk is itself a list, containing a file path and a dataframe, and the data frame has columns for ID, N, %, and mle? or 2) long format data frame with cols for filepath, ID, Nm, %, mle — dww, Aug 29 '16 at 20:17
I get data like this. I just read the whole thing at once; create `tab = cumsum(A regex matches "c:\\")`; `split(DF, tab)`; and then loop over the tabs, dropping empty/irrelevant rows and cols and grabbing col names. — Frank, Aug 29 '16 at 20:18
It's not quite ready yet, but I know the "mini-tables" approach to Excel is one that Jenny Bryan and Rich FitzJohn are trying to build [tools to deal with](https://github.com/rsheets/README). For now, reading it all in and cleaning up after the fact is probably your best bet. — alistaire, Aug 29 '16 at 20:29

score 2 · Accepted Answer · answered Aug 29 '16 at 21:39

I would tackle this much like @Frank mentioned in comments. I'd read the whole thing in as one big file and then split it up into a list based on the file path information. Once these are split into a list each dataset can be cleaned up in a lapply loop.

I'm reaingd in via readxl::read_excel, but you can likely read the whole file in via function from XLconnect if you'd prefer.

library(readxl)
orig = read_excel("test.xlsx", col_names = FALSE)

The first six line of some fake data look like:

                                                    X0    X1    X2    X3    X4    X5    X6
                                                 <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 c:\\file directory\\acoustic data\\Site 10\\20160517  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2                               identification summary  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
3                                                   ID  EPFU  LANO  <NA>   MID  <NA>  <NA>
4                                                    N    70    12  <NA>     4  <NA>  <NA>
5                                                    %    16     3  <NA>    13  <NA>  <NA>
6                                               MLE(p)  1E-3  2E-3  <NA>  <NA>  <NA>  <NA>

This original file should be split into separate tables, which can be done based on the rows that contain the file path information starting with "c:". To figure out where these are, use cumsum with grepl on the first column of the whole dataset.

groups = cumsum(grepl("c:", orig$X0))

Use this vector to split the file, saving each separate table into a list via split.

orig_list = split(orig, groups)

Now the rest of the work is cleaning up each dataset, transposing things and removing any extra rows and columns. This is also where you might pull out the Site and Date information from the file path to add to the dataset to keep things organized, which I demonstrate but isn't strictly necessary. I put this all into a function for use in lapply. Note I used readr::type_convert at one point to make sure numeric variables were converted appropriately.

clean_data = function(data) {
    # Get rid of any empty headers (missing values in first column)
    new = data[!is.na(data[,1]),]

    # Transpose
    new = t(new)

    # Put into data.frame, removing extraneous columns 1 to 2
            # and using first row as variable names
    # Convert variables to appropriate type at same time using readr::type_convert
    new2 = readr::type_convert(as.data.frame(new[-1, -(1:2)]))
    names(new2) = new[1, -(1:2)]

    # Add site and date columns by pulling info from original "c:\\..." cell
    new2$site = unlist(strsplit(new[1], "\\\\"))[4]
    new2$date = unlist(strsplit(new[1], "\\\\"))[5]

    # Remove everything after first missing ID
    new2[cumsum(is.na(new2$ID)) == 0,]
}

Now loop through all chunks and clean each one. The resulting cleaned files will be in a list. You could row bind these together into one dataset if needed.

lapply(orig_list, clean_data)

Hey aosmith, this is great, but I'm struggling with the 'clean_data' function. I'm getting this error 'Error: `col_names` must be TRUE, FALSE or a character vector' — NoobR, Aug 30 '16 at 15:06
The error sounds like it might come from the naming of the columns based on the first row of the transposed dataset. To troubleshoot the function, take the very first element of the split list, `orig_list[[1]]` and run through each step of the function on this dataset to make sure everything works (i.e., first remove empty headers, then transpose, etc.). Also make sure your real data look pretty much like my fake data. If not, the function will need tweaking to match your exact situation. — aosmith, Aug 30 '16 at 15:16

Michael Davidson · Answer 2 · 2016-08-30T15:17:37.593

Quick and dirty, but this should work if each chunk is the same dimensions, like the example implies:

library(XLConnect)

# Read the whole sheet in once
df <- readWorksheetFromFile("file directory",sheet=2, header=FALSE)

# Figure out how many code chunks you have (each appears to be 7 rows)
nChunks <- floor(nrow(df)/7)

# create blank list where you can house the different chunks
l <- vector("list", length=nChunks)

# Iterate over the chunks reading them each in to their own list element
for(i in 1:nChunks){
  if(i > 1){
  l[[i]] <- t(readWorksheetFromFile("file directory", sheet=2, header=FALSE, startCol=2, startRow=3, endCol=5, endRow=6))
  }
  else{
    l[[i]] <- t(readWorksheetFromFile("file directory", sheet=2, header=FALSE, startCol=2, startRow=3+(7*i), endCol=5, endRow=6+(7*i)))
  }
}

Then you can do the same transformations, but with lapply to take advantage of the chunks being in a list.

The different possible column labels make this a little complicated, but since you said there are a relatively small number of sets of column names, I would just write a function that replaces them depending on the initial names:

renameCols <- function(x){
        # First possible permutation
if(identical(colnames(x),c("nameOfColumn1","nameOfColumn2","nameOfColumn3")) {colnames(x) <- c("newName1","newName2","newName3")}
        # Second possible permutation
if(identical(colnames(x),c("nameOfColumn1","nameOfColumn2","nameOfColumn3")) {colnames(x) <- c("newName1","newName2","newName3")}
# ... etc
return(x)
}

Then apply the column name replacement function to each chunk (each stored in a different element of the list):

lapply(l, renameCols)

Thank you. This works except that the columns do differ, my apologies if that was unclear in the post...any suggestions there? — NoobR, Aug 29 '16 at 20:37
Hmm.... are there a limited number of permutations or will they pretty much all be different? — Michael Davidson, Aug 29 '16 at 20:47
There are a maximum number of species possible, so I would say it could range from 0-11....It depends on where we're sampling, but they're generally around 4-6. — NoobR, Aug 29 '16 at 20:54
Okay, I took a crack at that. Since its a pretty small number, you could just write a function to replace the names based on what they are initially. Good luck. — Michael Davidson, Aug 30 '16 at 15:19
The names are still a work in progress, but everything else comes out nice and clean this way. Thanks! — NoobR, Aug 30 '16 at 16:07

score 0 · Answer 3 · answered Oct 20 '16 at 16:28

I know this is an old post, but I'll add my .02 anyway. I think you should use some simple VBA to get everything organized in Excel and then read a very well structured file into R. I think things like this are a lot easier to work with using Excel, compared to doing everything in R, as you can clearly see. You should always use the right tool for the job.

Proprietary software produces ugly excel tables...can I get them into R?

3 Answers3