I have access to a large computing cluster with many nodes each of which has >16 cores, running Slurm 20.11.3. I want to run a job in parallel using furrr::future_pmap()
. I can parallelize across multiple cores on a single node but I have not been able to figure out the correct syntax to take advantage of cores on multiple nodes. See this related question.
Here is a reproducible example where I made a function that sleeps for 5 seconds and returns the starting time, ending time, and the node name.
library(furrr)
# Set up parallel processing
options(mc.cores = 64)
plan(
list(tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16),
tweak(multicore, workers = 16))
)
fake_fn <- function(x) {
t1 <- Sys.time()
Sys.sleep(x)
t2 <- Sys.time()
hn <- system2('hostname', stdout = TRUE)
data.frame(start=t1, end=t2, hostname=hn)
}
stuff <- data.frame(x = rep(5, 64))
output <- future_pmap_dfr(stuff, function(x) fake_fn(x))
I ran the job using salloc --nodes=4 --ntasks=64
and running the above R script interactively.
The script runs in about 20 seconds and returns the same hostname for all rows, indicating that it is running 16 iterations simultaneously on one node but not 64 iterations simultaneously split across 4 nodes as intended. How should I change the plan()
syntax so that I can take advantage of the multiple nodes?
edit: I also tried a couple other things:
- I replaced
multicore
withmultisession
, but saw no difference in output. - I replaced the
plan(list(...))
withplan(cluster(workers = availableWorkers())
but it just hangs.