2

I have a huge excel file to manipulate. I need to read colors and styles of a big number of cells and thought to speed up calculations by parallelising tasks. I'm relying on the xlsx package and its function getCellStyle to grab the style cell by cell. That package, in turn, relies on rJava. It looks like that, for some reason, tasks involving java objects can not be parallelised. Here a reproducible example:

require(xlsx)
require(writexl)
require(doParallel)
require(foreach)
require(parallel)

#We create an excel file with the iris dataset
filename <- "iris.xlsx"
write_xlsx(iris, filename)
#Read the workbook and the first (and only) sheet
wb <- loadWorkbook(filename)
sheet <- getSheets(wb)[[1]]
#With the next two rows we grab all the cells as Java objects
rows  <- getRows(sheet)
allcells <- getCells(rows)
#This works: grabbing the style
styles <- lapply(allcells, getCellStyle)
styles[[1]]
#[1] "Java-Object{org.apache.poi.xssf.usermodel.XSSFCellStyle@abd07bb0}"

#Now we try to go parallel: we create a cluster and make
#use of foreach and dopar
registerDoParallel(6)
stylePar<-foreach(i = seq_along(allcells)) %dopar% getCellStyle(allcells[[i]])
#Unfortunately, every Java object looks null
stylePar[[1]]
#[1] "Java-Object<null>"
#For the record, even mclapply returns all Java null objects
#mclapply(allcells, getCellStyle, mc.cores = 6, mc.preschedule = FALSE)

Am I missing something or it's inherently impossible to use foreach with Java objects? Consider that I'm just reading values and not setting them.

nicola
  • 24,005
  • 3
  • 35
  • 56

3 Answers3

2

As other solutions not pointing this out we have to state that it is impossible to do.
I find the same topic from 8 years ago on the R mailing list.

https://stat.ethz.ch/pipermail/r-devel/2013-November/067960.html

"It’s a limitation of the Java runtime - you cannot fork a JVM."

Other source: how to fork JVM?

So this is not a limitation of R, more of JVM.

Solution under library(future.apply) is working as any plan was activated so the base lapply was used. Should be invoked like:

library(future)
library(future.apply)
plan(multisession)

Last thing to comment is that multiprocessing is not so trivial and could be working under different paradigms. Please check out the future vignette https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html. You could find out synchronous and asynchronous processing. More than that each has own distinctions. For me is important to remember that we have multisession and multicore (multithreating) which are working by far differently.

polkas
  • 3,797
  • 1
  • 12
  • 25
0

TLDR: I think initializing the wb inside the foreach loop should solve your problem.

The parallel package in R handles parallel processing in a rather simplistic way: it launches several instances of R, copies the necessary data into each instance, runs your code and then returns your result.

The problem I suspect you are running into is that R doesn't really know how to copy rJava objects into a new instance of R.

Cliff AB
  • 1,160
  • 8
  • 15
  • 1
    Thank you, your observation makes sense. However, reading the workbook in each iteration is such an overkill that won't be compensated by the benefit of running in parallel (it's also very memory expensive). Guess I need another solution or abandon the idea of running in parallel. – nicola Jul 07 '21 at 15:16
  • This is not true, e.g. mclapply on MacOs and Linux enable multithreating not discussed multisession. https://www.rdocumentation.org/packages/parallel/versions/3.4.0/topics/mclapply – polkas Jul 09 '21 at 19:04
  • @polkas In this case you can try and see that even `mclapply` doesn't work. So either it's not entirely true that `mclapply` shares memory or there is something inherent with Java objects. – nicola Jul 12 '21 at 08:45
  • My comment was only about omitting when commenting other methods than multisession. I know that mclapply will not work here too, source of it is the same. – polkas Jul 12 '21 at 13:48
0

I'm more familiar with future and future.apply. Subbing out foreach for future.apply seems to work. The objects match the base::lapply

require(xlsx)
require(writexl)
#require(doParallel)
#require(foreach)

library(future.apply)

#We create an excel file with the iris dataset
filename <- "iris.xlsx"
write_xlsx(iris, filename)
#Read the workbook and the first (and only) sheet
wb <- loadWorkbook(filename)
sheet <- getSheets(wb)[[1]]
#With the next two rows we grab all the cells as Java objects
rows  <- getRows(sheet)
allcells <- getCells(rows)
#This works: grabbing the style
styles <- lapply(allcells, getCellStyle)
styles[[1]]
#[1] "Java-Object{org.apache.poi.xssf.usermodel.XSSFCellStyle@abd07bb0}"

# Specify how to parallize
plan(multisession)

# Apply getCellStyle to every object in allcells
stylePar <- future_lapply(allcells, getCellStyle)
stylePar[[1]]

styles[[1]] == stylePar[[1]]
#Now we try to go parallel: we create a cluster and make
#use of foreach and dopar
# registerDoParallel(6)
# stylePar<-foreach(i = seq_along(allcells)) %dopar% xlsx::getCellStyle(allcells[[i]])
# #Unfortunately, every Java object looks null
# stylePar[[1]]



Mxblsdl
  • 484
  • 6
  • 16
  • Thank you, I didn't know this package. However, I tested it in my real use case and it's actually significantly slower than the serial execution. It's a little bit counterintuitive on how to set cores/threads, so maybe tweaking with the parameters the situation might improve. I suspect that the huge overhead of creating the chunks might drop any benefit of the parallel execution. – nicola Jul 12 '21 at 08:39
  • 1
    Does this actually implement a parallel `lapply `or is it just the usual sequential? e.g. does a parallel `plan` need to be specified, like `plan(multisession)` before executing `future_lapply` – user20650 Jul 12 '21 at 10:57
  • Precisely a simply lapply was presented as any plan was activated. – polkas Jul 12 '21 at 13:46
  • Oops you're right I forgot you do need to specify the `plan(multisession)` or it defaults to `sequential`. Edited above – Mxblsdl Jul 12 '21 at 18:33
  • Now your code producing `Error in .jcall(cell, "Lorg/apache/poi/ss/usermodel/CellStyle;", "getCellStyle") : RcallMethod: attempt to call a method of a NULL object.`. I created an answer why this will not work. – polkas Jul 12 '21 at 20:46