2

I've been working on a function to handle a large Corpus. In it I use the doparallel package. Everything was working fine on 50 - 100k documents. I tested on 1M documents and received the above error.

However, when I go back down to a size of corpus I was working on previously, I still get the same error. I even tried going as low as 1k documents. The error is generated as soon as I hit enter when calling the function in the console.

Though I have 15 cores I tested this by going as low as just two cores - same issue.

I also tried restarting my session and clearing the environment with rm(list = ls())

Code:

clean_corpus <- function(corpus, n = 1000) { # n is length of each peice in parallel processing

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # save memory

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile() 
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # save memory

  # doparallel
  registerDoParallel(cores = 14)
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    # update spelling
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # spelling update based on lut
    piece <- tm_map(piece, function(i) stringi_spelling_update(i, spellingdoc))
    # regular transformations
    piece <- tm_map(piece, removeNumbers)
    piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
    saveRDS(piece, paste0(tmpfile, i, ".rds"))
    return(1) # hack to get dopar to forget the piece to save memory since now saved to rds
  }

  # combine the pieces back into one corpus
  corpus <- list()
  corpus <- foreach(i = seq_len(lenp)) %do% {
    corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
  }
  corpus <- do.call(function(...) c(..., recursive = TRUE), corpus)
  return(corpus)

} # end clean_corpus function

Then when I run it, even on a small corpus:

> mini_cleancorp <- clean_corpus(mini_corpus, n = 1000) # mini_corpus is a 10k corpus
 Show Traceback

 Rerun with Debug
 Error in mcfork() : 
  unable to fork, possible reason: Cannot allocate memory 

Here are some screen shots of top in the terminal just before I try to run the function. enter image description here enter image description here

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • I would always create the cluster explicitly and close it after use. You could try using `stopImplicitCluster`. – Roland Aug 24 '17 at 08:59
  • Thanks for the tip, would the appropriate place to add that in the function be right after the closing ```}``` in the dopar block? – Doug Fir Aug 24 '17 at 09:00
  • Yes. However, your problem could also be too many open file connections. I really don't get why you export to file and import again within the same function call. Is that for memory reasons? Can't you use `foreach`'s `.combine` parameter? – Roland Aug 24 '17 at 09:04
  • Yes, memory issues. I've been really trying hard t beat memory limitations which is why I'm doing that. Yes, tried .combine but hit memory limits. Saving each iteration into a temporary RDS file then deleting the storage fo the iteration (return(1)) did seem to get the job done, albiet perhaps slower than otherwise – Doug Fir Aug 24 '17 at 09:05
  • I don't think that strategy is really worthwhile. You should probably simply reduce the number of cores instead. – Roland Aug 24 '17 at 09:07
  • I might yet try that. The frustrating this is that this was all working fine a few hours ago! – Doug Fir Aug 24 '17 at 09:10
  • Actualy, could you expand on why? Not questioning you, just new to parallel processing and want to understand the trade off? Is it a time thing or a simplicity thing? – Doug Fir Aug 24 '17 at 09:11
  • Too many file connections can cause problems, in particular if you serialize large objects. Since you combine all results in a list, the only memory you save with your approach is the amount the workers need during the iterations. Try finding out how memory usage scales with number of workers and then find a balance between speed and memory usage. – Roland Aug 24 '17 at 09:14
  • Initially I experimented with 2 - 4 cores only but when leaving the script to run it took a few days and I would get errors I didn't understand in the shell like "broken pipe". This is why I was trying to make it faster using more cores. – Doug Fir Aug 24 '17 at 09:17
  • 1
    You have errors running with few cores and try to solve this with throwing more cores at it? Umm, no. Try understanding the errors first. Anyway, benchmark memory usage and speed with increasing numbers of cores (you should always do that for non-trivial tasks). – Roland Aug 24 '17 at 09:20

1 Answers1

4

When you are using registerDoParallel(cores) while on a Unix system you end up getting workers that are forked processes of the main R session. That you get this is also confirmed by "mcfork()" in the error message.

Now, when using forked parallel processing, the workers "share" the memory of whatever is in the main R session. This is in your advantage. However, any new object not in the main R session at the time of forking (i.e. when you call foreach()) will allocate new memory in the worker and therefore add to the overall memory consumption. This also applies to loaded packages.

For instance, in your first foreach() loop you call qdap::rm_stopwords() and tm::stopwords(). This means that, if packages qdap and tm are not loaded in the main R session, each of the 14 forked processes, will load them (and their dependencies) independently and thereby occupy 14 times the memory needed by those packages. So, in a fresh R session, compare the overall memory usage with and without:

 loadNamespace("qdap")
 loadNamespace("tm")

I did a very rough check and it looks like qdap and its dependencies consume about 3 GiB of RAM. Thus, loading that independently in 14 cores (= workers), will consume 42 GiB of RAM. If you load it prior to calling foreach() your memory overall consumption should remain around 3 GiB.

HenrikB
  • 6,132
  • 31
  • 34
  • Thank you for answering this is new information for me. Actually, the reason I prepended qdap:: and tm:: is that, for some reason, on my linux hosted RStudio whenever I call a package function I would get an error. I had to prepend every installed package function call with packagename::functionname(). IN this case both qdap and tm are already loaded in the parent session, even though I reference the libraries in the foreach – Doug Fir Aug 27 '17 at 05:13