2

I have a directory containing n h5 file each of which has m image stacks to filter. For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing (Ref to Dask). I will use the dask arrays through the apply_parallel() function in scikit-image.
I will run the processing on a small server with 20 cpus.

I would like to get an advice to which parallel strategy will make more sense to use:

1) Sequential processing of the h5 files and all the cpus for dask processing
2) Parallel processing of the h5 files with x cores and use the remaining 20-x to dask processing.
3) Distribute the resources and parallel processing the h5 files, the images in each h5 files and the remaining resources for dask.

thanks for the help!

s1mc0d3
  • 523
  • 2
  • 15

2 Answers2

0

Use make for parallelization.

With make -j20 you can tell make to run 20 processes in parallel.

By using multiple processes, you avoid the cost of the "global interpreter lock". For independent tasks, it is more efficient to use multiple independent processes (benchmark if you have doubt). Make is great for processing whole folders where you need to apply the same command to each file - it is traditionally used for compiling source code, but it can be used to run arbitrary commands.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I'm not confident that the GIL is a serious barrier here. All relevant libraries listed (numpy, skimage) release the GIL. – MRocklin Feb 10 '16 at 21:05
  • Nevertheless, if the tasks are independent, it almost always pays off to run multiple independent instances (at most, you may want to make one instance per batch). Crash recovery (e.g. if one image fails to decode) and all that is trivial if you have each file a separate task. – Has QUIT--Anony-Mousse Feb 10 '16 at 22:25
0

It is always best to parallelize in the simplest way possible. If you have several files and just want to run the same computation on each of them then this is almost certainly the simplest approach. If this saturates your computational resources then you can stop here without diving into more sophisticated methods.

If this is indeed your situation then you can parallelize done with dask, make, concurrent.futures or any of a variety of other libraries.

If there are other concerns, like trying to parallelize the operation itself or making sure you don't run out of memory then you are forced into more sophisticated systems like dask, but this may not be the case.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • After quite a few testing the most efficient way is to run dask for the processing using all cores and run the files sequentially. Next step will be to move away from our in house server and use Google Dataproc and run one file per worker. – s1mc0d3 Apr 01 '16 at 09:06