For a research projekt I need to extract information from a lot of pdf documents which are provided online.
In order to get the information I use the "tabulizer" package (with the packages "rJava" and "tabulizerjars" installed). With "extract_tables()" I have solved this problem already. Due to the size of some pdf documents (ca. 1000 pages) I needed to increase the ram java is allowed to use via options(java.parameters = "-Xmx8000m"). However since I need to repeat this process many times and due to the time it takes to read out the pdf files, I tried to parallelize the loop using a foreach loop and the doParallel backend.
Unfortunately I do not seem to be able to increase the ram available for Java since I do not believe "options(java.parameters = "-Xmx8000m")" works for the parallel sessions since I get the error: "task 1 failed - "java.lang.OutOfMemoryError: GC overhead limit exceeded", which I do not receive using a sequential loop.
I am using a windows machine with 8GB ram and 2 physical and 4 simulated cores. But even using machines offering more ram (16GB) did not seem to do the trick.
I have provided a short version of my Code including the sequential part which is working just fine and the parallel part which is causi
#General
options(java.parameters = "-Xmx8000m")
library("rio")
#The 64bit Java Version needs to be installed
library("rJava")
library("tabulizerjars")
library(tabulizer)
library("foreach")
library("doParallel")
#Location of pdf files
Urls <- cbind("https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000011.PDF","https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000081.PDF","https://www.ffiec.gov/CraAdWeb/pdf/2017/D1-100000000241.PDF")
#Sequential
for (i in 1:3){
location <- Urls[i]
#Extracted data are assigned to variables.
assign(paste0("Bank",i), extract_tables(location))
}
#Parallel (Mentioned Problem arises here)
cl <- makeCluster(3)
registerDoParallel(cl)
Daten <- foreach(i=1:3, .packages= c("rJava", "tabulizerjars", 'tabulizer')) %dopar% {
location <- Urls[i]
#Extracted data are assigned to variables.
assign(paste0("Bank",i), extract_tables(location))
}
I have been searching for a solution for around a week or two. Your help would be very much appreciated.
Looking forward to your answers or suggestions.
Lakue101