joblib.Parallel is reusing generated numbers instead of redoing for each process

Question

I have many processes to do, each can take up to 20 minutes and uses 100% CPU. I am new to multiprocessing and I decided to use joblib since it seems to let me multiprocess without threading (I have 12 cores and would like to do 12 processes at a time, starting new ones as the old ones finish, and I could not get this to work with Pool or mp.Process).

I am running python2.7 and have recreated a simple version of what is happening.

from joblib import Parallel, delayed
import numpy as np
from time import sleep


def do_something():
    print np.random.choice([0, 1])
    sleep(3)


if __name__ == '__main__':
    Parallel(n_jobs=3, backend='multiprocessing')(delayed(do_something)() for n in xrange(30))

Output is always in sets of threes, either '1 1 1' or '0 0 0', so the number is only generated for the first process. I thought that joblib.Parallel would just call the function 30 separate times and use 3 cores to do so.

Is there a way to make it so that a new number is generated each time do_something() is called?

** edit: Apparently this just how random generators work; they use the timestamp on your computer. When you call in parallel, the call time is the same for all workers so they will all generate the same number. Since I know how many times the function will be called in my real code, I solved this by generating a list of random numbers beforehand and pulling from that list in each call.

Dan D. · Accepted Answer · 2019-07-30T20:33:49.797

1

You need to reinitialize the random number generator in each worker. You can do this by calling numpy.random.seed with suitable random input. Acquiring such input is no small problem. But it isn't considered incorrect to get it from one of the kernel interfaces.

This can only occur on platforms with fork. The workers start out with a copy of their parent's state.

edited Jul 30 '19 at 20:33

answered Jul 30 '19 at 20:18

Dan D.

73,243
15
104
123

I am not sure what you mean by reinitialize. Will I need to do this for every part of my function (my real do_something() function does a LOT of stuff and is about 100 lines long) – jpsotka Jul 30 '19 at 20:32

joblib.Parallel is reusing generated numbers instead of redoing for each process

1 Answers1