Iteratively resave a directory tree of Excel files

Question

I am regularly receiving data from a source that is producing a non-standard Excel format which can't be read by readxl::read_excel. Here is the github issue thread. Consequently I have a whole directory tree containing hundreds of (almost) Excel files that I would like to read into R and combine with plyr::ldply The files can, however, be opened just fine by XLConnect::loadWorkbook. But unfortunately, even with allocating huge amounts of memory for the Java virtual machine, it always crashes after reading a few files. I tried adding these three lines to my import function:

options(java.parameters = "-Xmx16g")
detach("package:XLConnect", unload = TRUE)
library(XLConnect)
xlcFreeMemory()

However, I still get:

Error: OutOfMemoryError (Java): Java heap space

All I need to do is resave them in Excel and then they read in just fine from readxl::read_excel. I'm hoping I could also resave them in batch using XLConnect and then read them in using readxl::read_excel. Unfortunately, using Linux, I can't just script Excel to resave them. Does anyone have another workaround?

Maybe you could batch resave the xls using a macro - sth in the veins of http://answers.microsoft.com/en-us/office/forum/office_2010-excel/how-to-i-batch-convert-hundreds-of-excel-2007/7d608c2e-67fa-48df-94b2-a5b6464d8172 — lukeA, Jul 27 '15 at 13:29

Hong Ooi · Answer 1 · 2015-07-27T14:17:34.780

Since you're on Linux, running an Excel macro to re-save the spreadsheets looks to be difficult.

You could start a separate R process to read each spreadsheet with XLConnect. This can be done in at least two ways:

Run Rscript with a script file, passing it the name of the spreadsheet. Save the data to a .RData file, and read it back in your master R process.
Use parLapply from the parallel package, passing it a vector of spreadsheet names and a function to read the file. In this case, you don't have to save the data to disk as an intermediate step. However, you might have to do this in chunks, as the slave processes will slowly run out of memory unless you restart them.

Example of the latter:

files <- list.files(pattern="xlsx$")
filesPerChunk <- 5
clustSize <- 4  # or how ever many slave nodes you want
runSize <- clustSize * filesPerChunk

runs <- length(files)%/%runSize + (length(files)%%runSize != 0)

library(parallel)

sheets <- lapply(seq(runs), function(i) {
    runStart <- (i - 1) * runSize + 1
    runEnd <- min(length(files), runStart + runSize - 1)
    runFiles <- files[runStart:runEnd]

    # periodically restart and stop the cluster to deal with memory leaks
    cl <- makeCluster(clustSize)
    on.exit(stopCluster(cl))

    parLapply(cl, runFiles, function(f) {
        require(XLConnect)
        loadWorkbook(f, ...)
    })
})

sheets <- unlist(sheets, recursive=FALSE)  # convert list of lists to a simple list

Iteratively resave a directory tree of Excel files

1 Answers1