relation between regular Dask and dask.distributed

Question

I don't understand the relation between regular Dask and dask.distributed.

With dask.distributed, e.g. using the Futures interface, I have to explicitly create a client, which is backed by a local or remote cluster, and then submit to it using client.submit().

With regular Dask, e.g. using the Delayed interface, I just use delayed() on my functions.

How does delayed (or compute) determine where my computation takes place? There must be some global state behind it – but how would I access it? If I understand correctly, delayed uses a dask.distributed client if it exists. Does it use something like

client = None
try:
    client = Client.current()
except ValueError:
    pass
if client is not None:
    # use client
else:
    # use default scheduler

If so, why not use the same logic for submit?

client = None
try:
    client = Client.current()
except ValueError:
    pass
if client is not None:
    # use client
else:
    # fail because futures don't work on the default scheduler

And finally, delayed objects and future objects appear very similar. Why can the first use both a dask.distributed client and the default scheduler, while futures need dask.distributed?

score 1 · Answer 1 · answered Feb 07 '21 at 21:26

Yes, there is some global state that assigns a current client

https://github.com/dask/distributed/blob/f3f4bffea0640c01fc54f49c3219cf5807d14c66/distributed/client.py#L93

If you call the compute method on a delayed object you'll end up using the current client

Dask delayed is just syntatic sugar that builds up a computation graph. When you call compute, the graph ends up being dispatched via the distributed client.

A future refers to a remote result on a cluster that may not be computed yet. The delayed object hasn't been submitted to the cluster

@delayed
def func(x):
   return x
a = func(1)

In this case, a is a delayed object. That task hasn't been queued on the cluster at all

future = client.compute(a, sync=False)

You get a future after the task has been submitted to the cluster.

score 0 · Answer 2 · answered Feb 07 '21 at 13:29

Dask has multiple backends. If you don't specify one everything runs on a local cluster with as many processes as you have cores in your CPU. When defining a cluster (local, Kubernetes, HPC, Spark) you can specify what you want exactly. However there is no difference on what the client sees only were and how it is executed.

All futures are executed on your backend as you send them, but you have to wait for the result to return. In the meantime you can do other stuff on the client. When it's finished, you can fetch the result with .result. I haven't worked with the futures API as much, but it should work like Python concurrent futures. This is also probably why you have to start a client beforehand. Dask wants to mirror the API as close as possible. More information here.

The delayed, dataframe or array API only sends calculation to the backend, after you called .compute(). You then have to wait for the result to return and can't do anything in between. More information here.

score -1 · Answer 3 · answered Feb 06 '21 at 16:31

-1

future cannot be used on a local machine (without a local cluster), since it triggers computation right away, so any further calculations in the same code will be blocked. delayed allows you to postpone computation until DAG is formed. So delayed can be run on a single machine with or without a cluster.

answered Feb 06 '21 at 16:31

SultanOrazbayev

14,900
3
16
46

This only addresses the last part of my question. Plus, why wouldn't it be possible to trigger computation right away with the default scheduler? – A. Donda Feb 06 '21 at 16:59

relation between regular Dask and dask.distributed

3 Answers3