0

I have a large object (~50GB) that I wish to break into roughly 10 non-disjoint subsets to perform analysis on concurrently. The issue is that I have had numerous bad experiences with parallelism in R whereby large memory objects from the environment are duplicated in each worker, blowing out the memory usage.

Suppose the subsets data[subset, ] are only ~5GB each in size, what method could I use to perform concurrent processing with properly controlled memory usage?

One solution I found was using jobs::jobs() which allows me to explicitly specify the objects exported. Unfortunately this works by starting up RStudio jobs which I don't have a way of programatically monitoring the completion of. I wish to do aggregation of all the analyses after they are all complete, and currently I would have to look at the job monitor in RStudio to check all jobs have completed before I proceed to run the next code block.

Ideally I would like to use something like futures but with a tightly controlled memory footprint, from my understanding futures will capture any global variables it finds (including my very large full subset) and send it off to each worker. However if I am able to control the exported data then the promises will allow my program to automatically continue once the heavy concurrent work is done.

So I am looking for a package, code pattern or other solution that would allow me to

  1. Be able to execute each iteration of
for (subset in subset_list) {
    data_subset <- data[subset, ]
    run_analysis(data_subset)
}

asynchronously (preferably through processes).

  1. Be able to wait for the completion of all analysis workers and continue on with a script.
  2. Strictly control the objects exported into the worker processes.

Following Henrik's comment, is it correct to believe the following creates duplicates of the data and data_split within the workers? If so, is there a strategy to avoid such duplication?

library(furrr)

plan(multisession(workers = 3))

data <- runif(2e8)

data_split <- split(data, sample(1:4, 2e8, replace = TRUE))

process <- function(x) {
    Sys.sleep(60)
    exp(x)
}

res <- future_map(data_split, process)
shians
  • 955
  • 1
  • 6
  • 21
  • 2
    "Be able to execute run_analysis(data[subset, ] ) asynchronously (preferably through processes)." No that is not the way because then `data` must be copied to all workers. You want something like `data <- split(data, subset); parallel:::parLapply(cl, data, run_analysis)`. I would need to check to confirm that this only copies necessary list elements to the workers. – Roland Aug 01 '23 at 06:07
  • Yes I do understand that, I just abbreviated my code for brevity. I'll edit the question to make this clear. – shians Aug 01 '23 at 06:57
  • 1
    (Futureverse author here): "from my understanding futures will capture any global variables it finds (including my very large full subset) and send it off to each worker" is not really correct, but I can see how you might have come to that conclusion depending on what you're doing. It's hard to give more advice without a reproducible example. I suggest that you create a minimal reproducible example that causes you to come to this conclusion, and then we can help you from there. Also, note that the different "future" functions have optional arguments for specifying exactly what is exported. – HenrikB Aug 01 '23 at 07:00
  • @HenrikB thanks for replying. I would need some time to figure out how to create a reproducible example, blowing out memory is not fun. Would you mind in the meantime providing some rule-of-thumb recommendations to avoid capture of large variables? I know that I've used the furrr package before without blowing out the memory, but without explicit control of the exports or clear idea of how to manipulate exports I'm reluctant to work it into a workflow. – shians Aug 01 '23 at 07:07
  • 1
    If you create a function, e.g. in a script, and pass that to the worker, then that function will carry all the environment where it was created, cf. `ls.str(environment(my_fcn))`. That is how R works and is nothing specific to parallel processing, but you do notice the extra payload it when parallelizing. Other than that, read the docs and create a reproducible example. – HenrikB Aug 01 '23 at 08:35
  • @HenrikB if I am understanding correctly, does this means that each worker would be sent a full copy of `data` and `split_data` in the example I've edited into the question? – shians Aug 01 '23 at 09:26
  • @HenrikB Do you believe that a combination of `carrier::crate()` and `future::future()` is a robust solution to this pattern? – shians Aug 02 '23 at 03:45
  • Each of your workers should get (read or create) its own data chunk. It avoids duplication in memory and can add parallelism to reading. – George Ostrouchov Aug 09 '23 at 20:36
  • @GeorgeOstrouchov Having to serialise data chunks to read back in by the workers might negate some of the performance I hoped to gain through parallel processing. Still, it is a potential technique I can try, thanks! – shians Aug 10 '23 at 07:01
  • Yes, if the results are not reductions, there is a cost to get them back in a serial session. But maybe you can do all the processing with the data split and only reduced quantities come back? What platform are you using? – George Ostrouchov Aug 10 '23 at 14:34

0 Answers0