5

I want to groupby a single column, and then use agg with mean for a couple of columns, but just select first or last for the remaining columns. This is possible in pandas, but isn't currently supported in Dask. How to do this? Thanks.

aggs = {'B': 'mean', 'C': 'mean', 'D': 'first', 'E': 'first'}
ddf.groupby(by='A').agg(aggs)
morganics
  • 1,209
  • 13
  • 27
  • I would [raise an issue](https://github.com/dask/dask/issues/new) for a feature request. – MRocklin Feb 24 '18 at 14:58
  • Thanks @MRocklin, issue is here: https://github.com/dask/dask/issues/3206 – morganics Feb 25 '18 at 09:50
  • This has been implemented [here](https://github.com/dask/dask/pull/3389) in April 2018. So your code should actually work out of the box now. – gies0r Aug 09 '20 at 14:37

1 Answers1

2

You can use dask.dataframe.DataFrame.drop_duplicates and then join to aggregate DataFrame:

df = pd.DataFrame({'F':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'A':list('aaabbb')})

print (df)
   A  B  C  D  E  F
0  a  4  7  1  5  a
1  a  5  8  3  3  b
2  a  4  9  5  6  c
3  b  5  4  7  9  d
4  b  5  2  1  2  e
5  b  4  3  0  4  f

from dask import dataframe as dd 
ddf = dd.from_pandas(df, npartitions=3)
#print (ddf)


c = ['B','C']
a = ddf.groupby(by='A')[c].mean()
b = ddf.drop(c, axis=1).drop_duplicates(subset=['A'])
df = b.join(a, on='A').compute()
print (df)
   A  D  E  F         B    C
0  a  1  5  a  4.333333  8.0
3  b  7  9  d  4.666667  3.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252