Best method to write to a numpy memmap, with parallelization

Asked May 16 '20 at 20:18

Active May 16 '20 at 20:18

Viewed 291 times

I am trying to write a large amount of data to a numpy memmap, and trying to speed it up using multiprocessing. Here is a minimal example of what I'm trying to do.

unProcessedData = np.memmap( 'file.memmap', dtype=np.uint16, mode='w+', shape=( 2500000 , 512))
numpyMemmap  = np.memmap( 'file.memmap', dtype=np.uint16, mode='w+', shape=( 2500000 , 512))

for i in range(2500000):
    numpyMemmap[i] = unProcessedData[i] +1

I've been trying to find the best method to parallelize this. I heard you have to be careful, since overhead in spawning an additional process, compared to a new thread, or even just staying single-threaded and utilizing an event loop to trigger actions.

So I'm writing to a memmap disk, I heard that multithreading are better than multi-processing, but I can't find a definite answer.

It seems that joblib may be particularly suited for this,

https://joblib.readthedocs.io/en/latest/auto_examples/parallel_memmap.html

But I can't find an exact example for what I'm trying to do.

asked May 16 '20 at 20:18

SantoshGupta7

5,607
14
58
116

Do you really need to iterate (in Python) with this fine grain? 2500000 times can be along even if you split it among 8 processors. – hpaulj May 16 '20 at 22:29
I'm not sure what you mean – SantoshGupta7 May 17 '20 at 00:16

Best method to write to a numpy memmap, with parallelization

0 Answers0