1

For a research projekt I need to extract information from a lot of pdf documents which are provided online.

In order to get the information I use the "tabulizer" package (with the packages "rJava" and "tabulizerjars" installed). With "extract_tables()" I have solved this problem already. Due to the size of some pdf documents (ca. 1000 pages) I needed to increase the ram java is allowed to use via options(java.parameters = "-Xmx8000m"). However since I need to repeat this process many times and due to the time it takes to read out the pdf files, I tried to parallelize the loop using a foreach loop and the doParallel backend.

Unfortunately I do not seem to be able to increase the ram available for Java since I do not believe "options(java.parameters = "-Xmx8000m")" works for the parallel sessions since I get the error: "task 1 failed - "java.lang.OutOfMemoryError: GC overhead limit exceeded", which I do not receive using a sequential loop.

I am using a windows machine with 8GB ram and 2 physical and 4 simulated cores. But even using machines offering more ram (16GB) did not seem to do the trick.

I have provided a short version of my Code including the sequential part which is working just fine and the parallel part which is causi

#General 

options(java.parameters = "-Xmx8000m")
library("rio")   
#The 64bit Java Version needs to be installed                   
library("rJava")                                                                                                                 
library("tabulizerjars")
library(tabulizer)
library("foreach")
library("doParallel")

#Location of pdf files
Urls <- cbind("https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000011.PDF","https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000081.PDF","https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000241.PDF")


#Sequential 

for (i in 1:3){

  location <- Urls[i]  

  #Extracted data are    assigned to variables.
  assign(paste0("Bank",i), extract_tables(location)) 
}



#Parallel (Mentioned Problem arises here)

cl <- makeCluster(3) 
registerDoParallel(cl)

Daten <- foreach(i=1:3, .packages= c("rJava", "tabulizerjars", 'tabulizer')) %dopar% {

  location <- Urls[i]  

  #Extracted data are    assigned to variables.
  assign(paste0("Bank",i), extract_tables(location))  

}

I have been searching for a solution for around a week or two. Your help would be very much appreciated.

Looking forward to your answers or suggestions.

Lakue101

zx8754
  • 52,746
  • 12
  • 114
  • 209
Lakue101
  • 11
  • 2

0 Answers0