How to keep partitions after performing a group-by aggregation in dask

Question

In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id.

However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically partitioned.

import pandas as pd
import numpy as np
import dask.dataframe as dd

df =pd.DataFrame(np.arange(16), columns=['my_data'])
df.index.name = 'my_id'

ddf = dd.from_pandas(df, npartitions=4)
ddf.npartitions
# 4

ddf.divisions
# (0, 4, 8, 12, 15)

aggregated = ddf.groupby('my_id').agg({'my_data': 'count'})
aggregated.divisions
# (None, None)

Is there a way to accomplish that?

score 2 · Answer 1 · answered Feb 16 '18 at 19:22

2

You probably can't maintain the same partitioning, because dask will need to aggregate counts between partitions. Your data will necessarily have to move around in ways that depend on the values of your data.

If you're looking to ensure that your output has many partitions then you might choose to use the split_out= keyword to agg

answered Feb 16 '18 at 19:22

MRocklin

55,641
23
163
235

1

But doesn't the index ensure that, for instance, ids from 0 to 4 are on the same partition? Then if I take the group with my_id=0, it is guaranteed that all the elements will be on the same partition and nothing needs to be moved. (in fact I can do apply there to get the result identically partitioned). I was wondering if it is possible to do with the same with the agg method. – pygabriel Feb 17 '18 at 00:51
Ah I see, I failed to see that your dataframe was indexed by `my_id` – MRocklin Feb 17 '18 at 21:23

How to keep partitions after performing a group-by aggregation in dask

1 Answers1