5

currently I'm working on a little R project to read some information out of Word files. Since those are zipped xml files under the hood, I thought that this task would be quite easy with R. My script basically works, but I wanted to increase its speed, so I had a look at the doParallel and foreach packages.

library(foreach)
library(doParallel)

cores <- detectCores()
cl <- makeCluster(cores - 1)
registerDoParallel(cl)


file_list <- list.files(path = "/path/to/word/files", pattern = glob2rx("*.docx"), ignore.case = TRUE, full.names = TRUE, recursive = TRUE)


final <- foreach(
  filename = file_list[1:4], .combine = rbind, .packages = c("stringr", "xml2", "tibble"),
  .verbose = T, .inorder = FALSE
) %dopar% {

  name <- str_extract(filename, "[0-9a-f]{40}")


  # doc <- read_xml(unzip(zipfile = filename,  files = c("word/document.xml")), encoding = "utf-8")


  df <- tibble(
    Name = name,
  )

  df
}

stopCluster(cl)

This script works fine, but if I uncomment the row containing the read_xml statement and start the script, I get non-consistent errors like

Fehler in { : task 1 failed - "Opening and ending tag mismatch: pPrPr line 2 and r [76]"

or

Fehler in { : task 1 failed - "Specification mandates value for attribute MERGEFORMAT [41]"

or

Fehler in { : task 1 failed - "Extra content at the end of the document [5]"

So using the xml2 package within a parallel environment seems not to be working. Switching from %dopar% to %do% solves the problem but I loose the speedup.

I know that pointers generated by xml2 are not valid through different threads of R, but my idea was to read and process one docx-file per thread. Any idea how to solve this problem?

david
  • 51
  • 1
  • 1
    What is happening in `unzip`? Does it physically unzip the file `word/document.xml` from an archive, and place the file in the working diretory? In which case, if executed in parallel, it will repeatedly overwrite the existing `document.xml` with a new file from a parallel archive, while the previous process is still trying to read it. – MrGumble Dec 15 '20 at 12:03

0 Answers0