currently I'm working on a little R project to read some information out of Word files. Since those are zipped xml files under the hood, I thought that this task would be quite easy with R. My script basically works, but I wanted to increase its speed, so I had a look at the doParallel
and foreach
packages.
library(foreach)
library(doParallel)
cores <- detectCores()
cl <- makeCluster(cores - 1)
registerDoParallel(cl)
file_list <- list.files(path = "/path/to/word/files", pattern = glob2rx("*.docx"), ignore.case = TRUE, full.names = TRUE, recursive = TRUE)
final <- foreach(
filename = file_list[1:4], .combine = rbind, .packages = c("stringr", "xml2", "tibble"),
.verbose = T, .inorder = FALSE
) %dopar% {
name <- str_extract(filename, "[0-9a-f]{40}")
# doc <- read_xml(unzip(zipfile = filename, files = c("word/document.xml")), encoding = "utf-8")
df <- tibble(
Name = name,
)
df
}
stopCluster(cl)
This script works fine, but if I uncomment the row containing the read_xml
statement and start the script, I get non-consistent errors like
Fehler in { : task 1 failed - "Opening and ending tag mismatch: pPrPr line 2 and r [76]"
or
Fehler in { : task 1 failed - "Specification mandates value for attribute MERGEFORMAT [41]"
or
Fehler in { : task 1 failed - "Extra content at the end of the document [5]"
So using the xml2 package within a parallel environment seems not to be working. Switching from %dopar%
to %do%
solves the problem but I loose the speedup.
I know that pointers generated by xml2 are not valid through different threads of R, but my idea was to read and process one docx-file per thread. Any idea how to solve this problem?