0

I'm learning multiprocessing in Python 3.6 and here is what I am trying to do. I have two arrays, each of 10 mln records and I have a function:

arrays = []
arr1 = np.random.randint(1,100,10000000)
arr2 = np.random.randint(1,100,10000000)
arrays.append(arr1)
arrays.append(arr2)

#Function
def execFun(arr):
    for i in arr:
        np.log(i)

I have two cores and knowing that I do the following:

t0 = time.time()
procs = []
for i in range(os.cpu_count()):
    proc = multiprocessing.Process(target = execFunc, args = (arrays[i],))
    procs.append(proc)
    proc.start()

for p in procs:
    p.join()
t1 = time.time()
print('time spent: ', t1 - t0)

The above results to BrokenPipeError: [Errno 32] Broken pipe in here:

 58 def dump(obj, file, protocol=None):
 59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)

Can someone please explain where did I dummy it up?

Vlad
  • 181
  • 2
  • 10
  • Does it run locally, i.e. without multiprocessing? Otherwise, does it run for much smaller inputs - say 100 records? I suspect the problem is that your inputs are too big to pickle, or perhaps unpickle. As an aside - you might want to use multiprocessing.Pool instead. – Bennet Dec 02 '17 at 19:10
  • @Bennet it really does on 100. I thought I might have done something silly there. It stops working on 10,000. Though, if I run it within single process (I guess it's locally) it works fine with any number. I haven't reached Pool yet. But thanks for a piece of advise I will give it a go later on. – Vlad Dec 02 '17 at 19:24
  • @Bennet is there any way to avoid pickle problem with large volumes? – Vlad Dec 02 '17 at 19:26
  • What's in the variable `args[i]`? I ask because `multiprocessing.Process()` will need to be able to pickle/unpickle it. – martineau Dec 02 '17 at 19:42
  • @martineau my bad. Sorry for confusion. It should have been an `arrays[i]`. It's a list with two np arrays there. I've edited the post. Chaged args[i] to arrays[i] – Vlad Dec 02 '17 at 19:55
  • I'm no `numpy` expert, but I believe `array`s can be pickled, so that's not likely the issue (assuming `ForkingPickler` can also handle them). Can we assume you also meant `target = execFun,` not `target = exec,`? – martineau Dec 02 '17 at 20:02
  • @martineau yes you are absolutely right. It's `execFunc`. I've edited the post. I've tried Python `list` as well but it results to the same problem. As @Bennet pointed out if I try per 100 elements in each array it works just fine. But once I amend each array size up to 10,000 it spits out same error. – Vlad Dec 02 '17 at 20:34
  • There is a hack... I think the memory limit is per _element_ - so you can easily pass say a list of 100 arrays, each being 1/100th of the full size (if a factor of 100 does it). Or, if you are loading this data from somewhere, you can pass instructions on how to load the data, instead of the data itself. Either way, loading the data in the parent process, pickling and then unpickling will waste a lot of memory. – Bennet Dec 03 '17 at 20:31
  • @Bennet yes, I think so as well. I've too cores only and splitting out the number of processes to (1,000,000 / 10,000) * 2 will result to 20 which is way too many for two cpu's. Anyways, thanks a lot for pointing out pickling moment otherwise I wouldn't have guessed it. Will keep my focus on Pool as you pointed. – Vlad Dec 03 '17 at 21:57

0 Answers0