3

I'm parallelizing a loop that creates a relatively large dataset at each iteration. I'm using foreach::foreach() along with the doParallel backend. When I use foreach the standard way, my RAM usage blows up way before the loop is done. I would thus like to have each iteration of foreach save the created dataset to a file on disk and drop it from the memory right after. Essentially, I want each iteration to only have a side effect. I've tried the following, where the .combine = c and the NULL return make foreach return just NULL at the end:

library(tidyverse)
library(foreach)
library(doParallel)

# parallel computation setup
numCores <- detectCores(logical = F)
registerDoParallel(numCores)

some_big_number <- 10

# foreach loop
foreach(i = 1:10, .combine = c) %dopar% {
  x <- rep(1, some_big_number) %>% enframe()  # task that creates large object
  filename <- paste0('X', i, '.csv')
  write_csv(x, filename)
  NULL
}

However, all the data created still seems to be stacked into memory while the loop is running, and my RAM still blows up. How can I achieve the desired behavior?

Ben
  • 429
  • 4
  • 11
  • 2
    Are you sure the memory blowing up is because of memory lingering after each iteration is done? If you run multiple memory-intensive operations in parallel, then you're going to get a lot of memory use because each iteration is using memory simultaneously with the others, it's not clear to me why you think this is something more than that. – Marius Jan 30 '20 at 06:17
  • Returning `NULL` is the right way to do what you want. You can try to add some `gc()` before but I'm not sure it would help. – F. Privé Jan 30 '20 at 06:50
  • 1
    How large are these large objects? Have you determined their size, multiplied by `numCores`, and compared with RAM usage? – Roland Jan 30 '20 at 07:03
  • 1
    @Marius and Roland you were right! The memory was blowing up even for small numbers of iterations, which showed that the problem was caused by each iteration, not by accumulating results in memory. I managed to change my code to make each iteration less memory intensive and it works smoothly now (using about 30-40% out of 32GB of RAM throughout the process). Thanks!! – Ben Jan 30 '20 at 19:32
  • @Ben fyi, if you want to speed things up as you code doesn't rely on the order of execution, you should add `.inorder=FALSE` which will essentially make the `foreach` start a new parallel execution everytime another one is finished (as opposed to when a group of `numCores` is finished. – baibo May 02 '20 at 10:35

0 Answers0