How do I actually get dask to compute a list of delayed or dask-container-based results?

Question

I have a trivially parallelizable task of computing results independently for many tables split across many files. I can construct delayed or dask.dataframe lists (and have also tried with, e.g. a dict), and I cannot get all of the results to compute (I can get individual results from a dask graph style dictionary using .get(), but again can't compute all results easily). Here's a minimal example:

>>> df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)
>>> numbers = [df['a'].mean() for _ in range(2)]
>>> dd.compute(numbers)
([<dask.dataframe.core.Scalar at 0x7f91d1523978>,
  <dask.dataframe.core.Scalar at 0x7f91d1523a58>],)

Similarly:

>>> from dask import delayed
>>> @delayed
... def mean(data):
...     sum(data) / len(data)
>>> delayed_numbers = [mean([1,2]) for _ in range(2)]
>>> dask.compute(delayed_numbers)
([Delayed('mean-0e0a0dea-fa92-470d-b06e-b639fbaacae3'),
  Delayed('mean-89f2e361-03b6-4279-bef7-572ceac76324')],)

I would like to get [3, 3], which is what I would expect based on the delayed collections docs.

For my real problem, I would actually like to compute on tables in an HDF5 file, but given that I can get that to work with dask.get() I'm pretty sure I'm specifying my deferred / dask dataframe step right already.

I would be interested in a solution that directly results in a dictionary, but I can also just return a list of (key, value) tuples to dict(), which is probably not a huge performance hit.

Docs fixed in https://github.com/dask/dask/commit/e9ae6e3e10f1bab1b17db5bb765beb34c92b71b0 — MRocklin, May 24 '16 at 01:13

score 10 · Accepted Answer · answered May 24 '16 at 01:11

10

Compute takes many collections as separate arguments. Try splatting out your arguments as follows:

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)

In [4]: numbers = [df['a'].mean() for _ in range(2)]

In [5]: dd.compute(*numbers)  # note the *
Out[5]: (1.5, 1.5)

Or, as might be more common:

In [6]: dd.compute(df.a.mean(), df.a.std())
Out[6]: (1.5, 0.707107)

answered May 24 '16 at 01:11

MRocklin

55,641
23
163
235

OK, this is "working", except that for my full-blown example it's quite slow (and both IO and CPU are heavily underutilized and I only see one thread... and dask.multiprocessing.get throws some exceptions). But that's an issue for a separate question. – Dav Clark May 24 '16 at 02:00
1

multiprocessing will have to move data between processes. I recommend using threaded.get (should be default) or using the distributed scheduler on a single machine (which avoids data movement better) http;//distributed.readthedocs.org – MRocklin May 24 '16 at 15:27
is there a way to call visualize() on the list of splatted arguments, or is that only able to be done on each item in the list? – user4446237 Dec 28 '20 at 16:52

How do I actually get dask to compute a list of delayed or dask-container-based results?

1 Answers1