5

I understand that dask work well in batch mode like this

def load(filename):
    ...

def clean(data):
    ...

def analyze(sequence_of_data):
    ...

def store(result):
    with open(..., 'w') as f:
        f.write(result)

dsk = {'load-1': (load, 'myfile.a.data'),
       'load-2': (load, 'myfile.b.data'),
       'load-3': (load, 'myfile.c.data'),
       'clean-1': (clean, 'load-1'),
       'clean-2': (clean, 'load-2'),
       'clean-3': (clean, 'load-3'),
       'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
       'store': (store, 'analyze')}

from dask.multiprocessing import get
get(dsk, 'store')  # executes in parallel
  1. Can we use dask to process streaming channel , where the number of chunks is unknown or even endless?
  2. Can it perform the computation in an incremental way. for example can the 'analyze' step above could process ongoing chunks?
  3. must we call the "get" operation only after all the data chunks are known , could we add new chunks after the "get" was called
sami
  • 501
  • 2
  • 6
  • 18

1 Answers1

4

Edit: see newer answer below

No

The current task scheduler within dask expects a single computational graph. It does not support dynamically adding to or removing from this graph. The scheduler is designed to evaluate large graphs in a small amount of memory; knowing the entire graph ahead of time is critical for this.

However, this doesn't stop one from creating other schedulers with different properties. One simple solution here is just to use a module like conncurrent.futures on a single machine or distributed on multiple machines.

Actually Yes

The distributed scheduler now operates fully asynchronously and you can submit tasks, wait on a few of them, submit more, cancel tasks, add/remove workers etc. all during computation. There are several ways to do this, but the simplest is probably the new concurrent.futures interface described briefly here:

http://dask.pydata.org/en/latest/futures.html

Community
  • 1
  • 1
MRocklin
  • 55,641
  • 23
  • 163
  • 235