iterate over GroupBy object in dask

Question

Is it possible to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried:

import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')
for name, df in groups:
    print(name)

However, this results in an error: KeyError: 'Column not found: 0'

More generally speaking, what kind of interactions does the dask GroupBy object allow, except from the apply method?

@StevenG thanks for this feedback. Maybe there is an issue with my setup — Arco Bast, Sep 27 '16 at 18:45
in your code you are iterating through pdf and not ddf, are you trying to iterate through ddf of pdf? — Steven G, Sep 27 '16 at 18:51
i want to iterate through ddf ... thanks for pointing that out. I edit my question. Can you iterate through the dask dataframe? — Arco Bast, Sep 27 '16 at 19:17

score 8 · Accepted Answer · answered Sep 27 '16 at 19:26

you could iterate through groups doing this with dask, maybe there is a better way but this works for me.

import dask.dataframe as dd
import pandas as pd
pdf = pd.DataFrame({'A':[1, 2, 3, 4, 5], 'B':['1','1','a','a','a']})
ddf = dd.from_pandas(pdf, npartitions = 3)
groups = ddf.groupby('B')

for group in pdf['B'].unique():
    print groups.get_group(group)

this would return

dd.DataFrame<dataframe-groupby-get_group-e3ebb5d5a6a8001da9bb7653fface4c1, divisions=(0, 2, 4, 4)>
dd.DataFrame<dataframe-groupby-get_group-022502413b236592cf7d54b2dccf10a9, divisions=(0, 2, 4, 4)>

score 5 · Answer 2 · answered Oct 03 '16 at 12:41

5

Generally iterating over Dask.dataframe objects is not recommended. It is inefficient. Instead you might want to try constructing a function and mapping that function over the resulting groups using groupby.apply

answered Oct 03 '16 at 12:41

MRocklin

55,641
23
163
235

1

Groupby.apply didn't work because of https://github.com/dask/dask/issues/1587, so i was looking for a workaround – Arco Bast Oct 06 '16 at 14:57
It was fixed in [https://github.com/dask/dask/pull/1625](https://github.com/dask/dask/pull/1625) – franchb Feb 07 '19 at 13:21
not always applicable due to "apply func once to each partition-group pair", see: https://stackoverflow.com/questions/60711871/dask-apply-with-custom-function – fjsj Apr 09 '20 at 02:07

iterate over GroupBy object in dask

2 Answers2

Linked