I have a huge excel file to manipulate. I need to read colors and styles of a big number of cells and thought to speed up calculations by parallelising tasks. I'm relying on the xlsx
package and its function getCellStyle
to grab the style cell by cell. That package, in turn, relies on rJava
. It looks like that, for some reason, tasks involving java objects can not be parallelised. Here a reproducible example:
require(xlsx)
require(writexl)
require(doParallel)
require(foreach)
require(parallel)
#We create an excel file with the iris dataset
filename <- "iris.xlsx"
write_xlsx(iris, filename)
#Read the workbook and the first (and only) sheet
wb <- loadWorkbook(filename)
sheet <- getSheets(wb)[[1]]
#With the next two rows we grab all the cells as Java objects
rows <- getRows(sheet)
allcells <- getCells(rows)
#This works: grabbing the style
styles <- lapply(allcells, getCellStyle)
styles[[1]]
#[1] "Java-Object{org.apache.poi.xssf.usermodel.XSSFCellStyle@abd07bb0}"
#Now we try to go parallel: we create a cluster and make
#use of foreach and dopar
registerDoParallel(6)
stylePar<-foreach(i = seq_along(allcells)) %dopar% getCellStyle(allcells[[i]])
#Unfortunately, every Java object looks null
stylePar[[1]]
#[1] "Java-Object<null>"
#For the record, even mclapply returns all Java null objects
#mclapply(allcells, getCellStyle, mc.cores = 6, mc.preschedule = FALSE)
Am I missing something or it's inherently impossible to use foreach
with Java objects? Consider that I'm just reading values and not setting them.