7

I have a trivially parallelizable task of computing results independently for many tables split across many files. I can construct delayed or dask.dataframe lists (and have also tried with, e.g. a dict), and I cannot get all of the results to compute (I can get individual results from a dask graph style dictionary using .get(), but again can't compute all results easily). Here's a minimal example:

>>> df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)
>>> numbers = [df['a'].mean() for _ in range(2)]
>>> dd.compute(numbers)
([<dask.dataframe.core.Scalar at 0x7f91d1523978>,
  <dask.dataframe.core.Scalar at 0x7f91d1523a58>],)

Similarly:

>>> from dask import delayed
>>> @delayed
... def mean(data):
...     sum(data) / len(data)
>>> delayed_numbers = [mean([1,2]) for _ in range(2)]
>>> dask.compute(delayed_numbers)
([Delayed('mean-0e0a0dea-fa92-470d-b06e-b639fbaacae3'),
  Delayed('mean-89f2e361-03b6-4279-bef7-572ceac76324')],)

I would like to get [3, 3], which is what I would expect based on the delayed collections docs.

For my real problem, I would actually like to compute on tables in an HDF5 file, but given that I can get that to work with dask.get() I'm pretty sure I'm specifying my deferred / dask dataframe step right already.

I would be interested in a solution that directly results in a dictionary, but I can also just return a list of (key, value) tuples to dict(), which is probably not a huge performance hit.

Dav Clark
  • 1,430
  • 1
  • 13
  • 26

1 Answers1

10

Compute takes many collections as separate arguments. Try splatting out your arguments as follows:

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = dd.from_pandas(pd.DataFrame({'a': [1,2]}), npartitions=1)

In [4]: numbers = [df['a'].mean() for _ in range(2)]

In [5]: dd.compute(*numbers)  # note the *
Out[5]: (1.5, 1.5)

Or, as might be more common:

In [6]: dd.compute(df.a.mean(), df.a.std())
Out[6]: (1.5, 0.707107)
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • OK, this is "working", except that for my full-blown example it's quite slow (and both IO and CPU are heavily underutilized and I only see one thread... and dask.multiprocessing.get throws some exceptions). But that's an issue for a separate question. – Dav Clark May 24 '16 at 02:00
  • 1
    multiprocessing will have to move data between processes. I recommend using threaded.get (should be default) or using the distributed scheduler on a single machine (which avoids data movement better) http;//distributed.readthedocs.org – MRocklin May 24 '16 at 15:27
  • is there a way to call visualize() on the list of splatted arguments, or is that only able to be done on each item in the list? – user4446237 Dec 28 '20 at 16:52