4

In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id.

However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically partitioned.

import pandas as pd
import numpy as np
import dask.dataframe as dd

df =pd.DataFrame(np.arange(16), columns=['my_data'])
df.index.name = 'my_id'

ddf = dd.from_pandas(df, npartitions=4)
ddf.npartitions
# 4

ddf.divisions
# (0, 4, 8, 12, 15)

aggregated = ddf.groupby('my_id').agg({'my_data': 'count'})
aggregated.divisions
# (None, None)

Is there a way to accomplish that?

pygabriel
  • 9,840
  • 4
  • 41
  • 54

1 Answers1

2

You probably can't maintain the same partitioning, because dask will need to aggregate counts between partitions. Your data will necessarily have to move around in ways that depend on the values of your data.

If you're looking to ensure that your output has many partitions then you might choose to use the split_out= keyword to agg

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    But doesn't the index ensure that, for instance, ids from 0 to 4 are on the same partition? Then if I take the group with my_id=0, it is guaranteed that all the elements will be on the same partition and nothing needs to be moved. (in fact I can do apply there to get the result identically partitioned). I was wondering if it is possible to do with the same with the agg method. – pygabriel Feb 17 '18 at 00:51
  • Ah I see, I failed to see that your dataframe was indexed by `my_id` – MRocklin Feb 17 '18 at 21:23