7

I have the following code that does serial processing with purr::pmap


library(tidyverse)

set.seed(1)
params <- tribble(
  ~mean, ~sd, ~n,
  5,     1,  1,
  10,     5,  3,
  -3,    10,  5
)
params %>% 
  pmap(rnorm)
#> [[1]]
#> [1] 4.373546
#> 
#> [[2]]
#> [1] 10.918217  5.821857 17.976404
#> 
#> [[3]]
#> [1]   0.2950777 -11.2046838   1.8742905   4.3832471   2.7578135

How can I parallelize (fork) the process above so that it runs faster and produces identical result?

Here, I use rnorm for illustration purpose, in reality I have a function that does heavy duty work. It needs parallelizing.

I'm open to non-purrr (non-tidyverse) solution, as long as it produces identical result given the rnorm function and params as input.

Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
neversaint
  • 60,904
  • 137
  • 310
  • 477
  • Is your actual function deterministic? If not and reproducibility is required for randomness too, I see no way. The results in each iteration depend on RNG state after previous iteration. Unless there's a way to know RNG state in advance (as seems to be the case in your example), and hacking around `.Random.seed`... but it's going to be ugly – Aurèle Nov 29 '17 at 14:08
  • See chapter 6 of https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf – Aurèle Nov 29 '17 at 14:20
  • Can you provide a more realistic example? Is the number of parameters (`nrow(params)`) high or the number of simulations `n`? – F. Privé Nov 29 '17 at 18:12
  • @Aurèle: my actual function is deterministic. So in my OP the exact reproducibility is not strictly required. – neversaint Nov 30 '17 at 00:02
  • You can always pass the seed value as an additional parameter, and `set.seed(...)` on each worker – CPak Nov 30 '17 at 18:06
  • Sure, I meant reproduce the sequential behavior in parallel. – Aurèle Dec 01 '17 at 07:50

1 Answers1

9

In short: a "parallel pmap()", allowing a similar syntax to pmap(), could look like: lift(mcmapply)() or lift(clusterMap)().


If you're not on Windows, you could:

library(parallel)

# forking

set.seed(1, "L'Ecuyer")
params %>% 
  lift(mcmapply, mc.cores = detectCores() - 1)(FUN = rnorm)

# [[1]]
# [1] 4.514604
# 
# [[2]]
# [1] 0.7022156 0.8734875 5.0250478
# 
# [[3]]
# [1]   8.7704060  11.7217925 -12.8776289 -10.7466152   0.5177089

Edit

Here is a "cleaner" option, that should feel more like using pmap:

nc <- max(parallel::detectCores() - 1, 1L)

par_pmap <- function(.l, .f, ..., mc.cores = getOption("mc.cores", 2L)) {
  do.call(
    parallel::mcmapply, 
    c(.l, list(FUN = .f, MoreArgs = list(...), SIMPLIFY = FALSE, mc.cores = mc.cores))
  )
}

f <- function(n, mean, sd, ...) rnorm(n, mean, sd) 

params %>% 
  par_pmap(f, some_other_arg_to_f = "foo", mc.cores = nc)

If you're on Windows (or any other OS), you could:

library(parallel)

# (Parallel SOCKet cluster)

cl <- makeCluster(detectCores() - 1)

clusterSetRNGStream(cl, 1)
params %>% 
  lift(clusterMap, cl = cl)(fun = rnorm)

# [[1]]
# [1] 5.460811
# 
# [[2]]
# [1] 7.573021 6.870994 5.633097
# 
# [[3]]
# [1] -21.595569 -21.253025 -12.949904  -4.817278  -7.650049

stopCluster(cl)

In case you're more inclined to use foreach, you could:

library(doParallel)

# (fork by default on my Linux machine, should PSOCK by default on Windows)

registerDoParallel(cores = detectCores() - 1)

set.seed(1, "L'Ecuyer")
lift(foreach)(params) %dopar%
  rnorm(n, mean, sd)

# [[1]]
# [1] 4.514604
# 
# [[2]]
# [1] 0.7022156 0.8734875 5.0250478
# 
# [[3]]
# [1]   8.7704060  11.7217925 -12.8776289 -10.7466152   0.5177089

stopImplicitCluster()
Aurèle
  • 12,545
  • 1
  • 31
  • 49
  • Thanks so much, I have another issue related to your approach. Would you mind look at it? https://stackoverflow.com/questions/47625279/how-to-preserve-the-list-of-data-frame-form-after-using-parallel-apply – neversaint Dec 04 '17 at 01:33