What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

Question

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps down the graph. But as far as I understood task computation results are still (de)serialized to hard drive in even in this case.

So the question is how much performance gain would I get trying to keep as little tasks as possible down a single path of independent computations in a graph:

A) many small "map" tasks along each path

t --> t --> t -->...
                     some reduce stage
t --> t --> t -->...

B) one huge "map" task along for each path

   T ->
        some reduce stage
   T ->

Thank you!

MRocklin · Accepted Answer · 2016-12-13T02:07:54.103

2

The dask multiprocessing scheduler will automatically fuse linear chains of tasks into single tasks, so your case A above will automatically become case B.

If your workloads are more complex and do require inter-node communication then you might want to try the distributed scheduler on a single computer. It manages data movement between workers more intelligently.

$ pip install dask distributed

>>> from dask.distributed import Client
>>> c = Client()  # Starts local "cluster".  Becomes the global scheduler

Correction

Also, just as a note, Dask doesn't persist intermediate results on disk. Rather it communicates intermediate results directly between processes.

edited Dec 13 '16 at 02:07

answered Dec 13 '16 at 01:05

MRocklin

55,641
23
163
235

thank you! that's interesting because we have unserializable object (lasio.LasFile - cloudpickle can't handle it by some reason https://github.com/kinverarity1/lasio/issues/143) as an exclusively intermediate result which only lives along single paths in a computation graph. And dask fails right from the start with the exception on unpickling it .. – Alexander Reshytko Dec 13 '16 at 14:12
If the object is in your dask graph, rather than as an intermediate output, then you should expect that error. When you say "right from the start" does that mean that the object is present in the computation you ask dask to compute? – MRocklin Dec 15 '16 at 14:03
no its instances are produced by the first "line" of tasks in the graph – Alexander Reshytko Dec 15 '16 at 16:59

What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

1 Answers1

Correction