In Dask, is there a way to process dependencies as they become available, as in multiprocessing.imap_unordered?

Question

I have a simple graph structure that takes N independent tasks and then aggregates them. I do not care in what order the results of the independent tasks are aggregated. Is there a way that I can speed up computation by acting on the dependencies as they become available?

Consider the following example. In it, parallel tasks each wait some random time, then return. An additional task collects the results, forming an ordered queue. If collection occurs asynchronously, then the order will be based on when the tasks complete. If collection occurs synchronously, then the order will be statically defined by the input.

from multiprocessing import Pool
from dask import delayed
import numpy as np
from time import sleep

def wait(i):
    """Something embarrassingly parallel"""
    np.random.seed()
    t = np.random.uniform()
    sleep(t)
    print(i, t)
    return i, t

def lineup(who_when):
    """Aggregate"""
    order = []
    for who, when in who_when:
        print(f'who: {who}')
        order.append(who)
    return order

Using imap_unordered, we see collection/reduction begins asap, before all dependencies finish.

n = 5
pool = Pool(processes=n)
lineup(pool.imap_unordered(wait, range(n)))

# Produces something like the following

2 0.2837069069881948
4 0.44156753704276597
who: 2
who: 4
1 0.5563172244950703
0 0.6696008076879393
who: 1
who: 0
3 0.9911326214345308
who: 3
[2, 4, 1, 0, 3]

Using dask.delayed, in the way that I'm accustomed, the results are like map(), where collection begins once all dependencies are available. The order is static.

n = 5
order = delayed(lineup)([delayed(wait)(i) for i in range(n)])
order.compute()

# produces something like:

0 0.2792789023871932
2 0.44570072028850705
4 0.6969597596416385
1 0.766705306208266
3 0.9889956337687371
who: 0
who: 1
who: 2
who: 3
who: 4
[0, 1, 2, 3, 4]

Is there an imap_unordered equivalent in dask? Perhaps something using dask.bag?

score 2 · Answer 1 · answered May 21 '20 at 00:22

Yes. You're probably looking for the as_completed function of the Dask Futures interface.

There is a Dask example here on Handling Evolving Workflows

I'll copy the docstring of the as_completed here for convenience

As Completed

Return futures in the order in which they complete

This returns an iterator that yields the input future objects in the order in which they complete. Calling next on the iterator will block until the next future completes, irrespective of order.

Additionally, you can also add more futures to this object during computation with the .add method

Parameters

futures: Collection of futures A list of Future objects to be iterated over in the order in which they complete

with_results: bool (False) Whether to wait and include results of futures as well; in this case as_completed yields a tuple of (future, result)

raise_errors: bool (True) Whether we should raise when the result of a future raises an exception; only affects behavior when with_results=True.

Examples

>>> x, y, z = client.map(inc, [1, 2, 3])  # doctest: +SKIP
>>> for future in as_completed([x, y, z]):  # doctest: +SKIP
...     print(future.result())  # doctest: +SKIP
3
2
4

Add more futures during computation

>>> x, y, z = client.map(inc, [1, 2, 3])  # doctest: +SKIP
>>> ac = as_completed([x, y, z])  # doctest: +SKIP
>>> for future in ac:  # doctest: +SKIP
...     print(future.result())  # doctest: +SKIP
...     if random.random() < 0.5:  # doctest: +SKIP
...         ac.add(c.submit(double, future))  # doctest: +SKIP
4
2
8
3
6
12
24

Optionally wait until the result has been gathered as well

>>> ac = as_completed([x, y, z], with_results=True)  # doctest: +SKIP
>>> for future, result in ac:  # doctest: +SKIP
...     print(result)  # doctest: +SKIP
2
4
3

In Dask, is there a way to process dependencies as they become available, as in multiprocessing.imap_unordered?

1 Answers1

As Completed

Parameters

Examples

Linked