I have a large object (~50GB) that I wish to break into roughly 10 non-disjoint subsets to perform analysis on concurrently. The issue is that I have had numerous bad experiences with parallelism in R whereby large memory objects from the environment are duplicated in each worker, blowing out the memory usage.
Suppose the subsets data[subset, ]
are only ~5GB each in size, what method could I use to perform concurrent processing with properly controlled memory usage?
One solution I found was using jobs::jobs()
which allows me to explicitly specify the objects exported. Unfortunately this works by starting up RStudio jobs which I don't have a way of programatically monitoring the completion of. I wish to do aggregation of all the analyses after they are all complete, and currently I would have to look at the job monitor in RStudio to check all jobs have completed before I proceed to run the next code block.
Ideally I would like to use something like futures
but with a tightly controlled memory footprint, from my understanding futures
will capture any global variables it finds (including my very large full subset) and send it off to each worker. However if I am able to control the exported data then the promises will allow my program to automatically continue once the heavy concurrent work is done.
So I am looking for a package, code pattern or other solution that would allow me to
- Be able to execute each iteration of
for (subset in subset_list) {
data_subset <- data[subset, ]
run_analysis(data_subset)
}
asynchronously (preferably through processes).
- Be able to wait for the completion of all analysis workers and continue on with a script.
- Strictly control the objects exported into the worker processes.
Following Henrik's comment, is it correct to believe the following creates duplicates of the data
and data_split
within the workers? If so, is there a strategy to avoid such duplication?
library(furrr)
plan(multisession(workers = 3))
data <- runif(2e8)
data_split <- split(data, sample(1:4, 2e8, replace = TRUE))
process <- function(x) {
Sys.sleep(60)
exp(x)
}
res <- future_map(data_split, process)