4

dask dataframe looks like this:

A     B     C     D
1     foo   xx    this
1     foo   xx    belongs
1     foo   xx    together
4     bar   xx    blubb

i want to groupy by columns A,B,C and join the strings from D with a blank between, to get

A     B     C     D
1     foo   xx    this belongs together
4     bar   xx    blubb

i see how to do this with pandas:

df_grouped = df.groupby(['A','B','C'])['D'].agg(' '.join).reset_index()

how can this be achieved with dask?

bucky
  • 392
  • 4
  • 18

2 Answers2

2
ddf = ddf.groupby(['A','B','C'])['D'].apply(lambda row: ' '.join(row)).reset_index()
ddf.compute()

Output:

Out[75]: 
   A    B   C                      D
0  1  foo  xx  this belongs together
0  4  bar  xx                  blubb
KRKirov
  • 3,854
  • 2
  • 16
  • 20
0

You could use a CustomAggregation, where both the per-chunk and aggregation operations are your ' '.join method.

https://docs.dask.org/en/latest/dataframe-api.html#custom-aggregation

MRocklin
  • 55,641
  • 23
  • 163
  • 235