Results of parallelization with snowfall library not reproducible?

Question

Each time I run the following code, the numbers in the vector result_seq remain the same, since I have used set.seed(11) before generating the vector.

However, it seems that even though I use set.seed(11) again before I generate the numbers in result_par, the numbers change every time I run the code.

library(snowfall)
snowfall::sfInit(parallel = TRUE, cpus = 4)

testFun = function(i) {
  result <- rnorm(1,10,3)
}

nsim <- 10

set.seed(11)
result_seq <- sapply(1:nsim, testFun)
print(mean(result_seq))

set.seed(11)
result_par <- sfLapply(1:nsim, testFun)
print(mean(as.numeric(result_par)))

Why is this happening? What can I do to ensure obtain the random numbers generated during the snowfall parallelization are reproducible?

score 1 · Answer 1 · answered Mar 01 '21 at 17:29

1

Since R is single-threaded, any parallel-izing of code is actually spinning up multiple sessions. So here you are actually spinning out 4 separate "child" sessions in sfLapply() and the seed setting is only happening once in your "parent" session. The "child" sessions are not aware of the others and thus not aware you want to re-set the seed in each of them.

You can move set.seed() into testFun() to solve this:

testFun = function(i) {
  set.seed(11)
  result <- rnorm(1,10,3)
}

sfExport might be worth exploring as it is designed to distribute parameters to the "child" sessions for contexts like this.

answered Mar 01 '21 at 17:29

Nate

10,361
3
33
40

2

If you're willing to switch your parallelization front-end, then the **future** framework guarantees numeric reproducible random numbers regardless of how you parallelize and how many parallel workers you run on, or you run sequentially, e.g. `result_par <- future.apply::future.lapply(1:nsim, testFun, future.seed = TRUE)`. – HenrikB Mar 01 '21 at 19:06
But if the seeding only happens in the parent session and then the parents spawn child sessions, shouldn't the numbers generated be the same on each run since the parent is always seeded with set.seed(11). It seems like the child process act as if set.seed(11) has not been called at all. – sonicboom Mar 02 '21 at 10:30
1

new sessions won't inherit the state of the "parent" (maybe that was a bad analogy on my part), but it's not like object oriented inheritance. But you can create the new/child sessions to replicate the original/parent environment. The args to `?future.apply::future_lapply()` are giving you the fine grain control of what to carry over into the new sessions – Nate Mar 02 '21 at 13:34
1

My rule of thumb: Use `set.seed()` only at the top of your script, if at all. If you find yourself setting it elsewhere, it's suggests you're doing something ad hoc and there's a risk it'll come back and bite you later, e.g. when you've forgotten about it and it resets your RNG stream you rely on elsewhere – HenrikB Mar 02 '21 at 17:25
1

@sonicboom, if you're asking about `future.seed = TRUE` of **future.apply**, then the answer is: that argument will produce statistically sound RNG substreams based on the current RNG state of the parent process (=your main R session). For more info, see for instance https://www.jottr.org/2017/02/19/future-rng/ and https://www.jottr.org/2020/09/22/push-for-statical-sound-rng/. – HenrikB Mar 03 '21 at 03:31

Results of parallelization with snowfall library not reproducible?

1 Answers1