1

I've created a Dask Dataframe (called "df") and the column with index "11" has integer values:

In [62]: df[11]
Out[62]:
Dask Series Structure:
npartitions=42
    int64
      ...
    ...
      ...
      ...
Name: 11, dtype: int64
Dask Name: getitem, 168 tasks

I'm trying to sum these with:

df[11].sum() 

I get dd.Scalar<series-..., dtype=int64> returned. Despite researching what this might mean I'm still at odds as to why I'm not getting a numerical value returned. How can I translate this into its numerical value?

jbentley
  • 163
  • 4
  • 13
  • 1
    `df[11].sum().compute()` not working too? – jezrael Oct 05 '18 at 10:40
  • Works great! I can't find .compute() in the documentation for .sum(), I must have missed something here. Or anywhere in particular in the documentation for dask. So I'm not sure why this has worked. Would you mind pointing me in the right direction? – jbentley Oct 05 '18 at 10:42

1 Answers1

3

I think you need compute for telling Dask to process everything that came before:

compute(**kwargs)
Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a numpy.array() and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

df[11].sum().compute()
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252