How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

Question

how can I get all unique groups in Dask from grouped data frame? Let's say, we have the following code:

g = df.groupby(['Year', 'Month', 'Day'])

I have to iterate through all groups and process the data within the groups. My idea was to get all unique value combinations and then iterate through the collection and call e.g.

g.get_group((2018,01,12)).compute()

for each of them... which is not going to be fast, but hopefully will work..

In Spark/Scala I can achieve smth like this using the following approach:

val res = myDataFrame.groupByKey(x => groupFunctionWithX(X)).mapGroups((key,iter) => {
 process group with all the child records
} )

I am wondering, what is the best way to implement smth like this using Dask/Python?

Any assistance would be greatly appreciated!

Best, Michael

UPDATE

I have tried the following in python with pandas:

df = pd.read_parquet(path, engine='pyarrow')
g = df.groupby(('Year', 'Month', 'Day'))
g.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))

And this was working perfectly fine. Afterwards, I have tried the same with Dask:

df2 = dd.read_parquet(path, engine='pyarrow')
g2 = df2.groupby(('Year', 'Month', 'Day'))
g2.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))

This has led me to the following error:

ValueError: Metadata inference failed in `groupby.apply(lambda)`.

Any ideas what went wrong?

MRocklin · Answer 1 · 2018-02-19T16:56:09.077

2

Computing one group at a time is likely to be slow. Instead I recommend using groupby-apply

df.groupby([...]).apply(func)

Like Pandas, the user-defined function func should expect a Pandas dataframe that has all rows corresponding to that group, and should return either a Pandas dataframe, a Pandas Series, or scalar.

Getting one group at a time can be cheap if your data is indexed by the grouping column

df = df.set_index('date')
part = df.loc['2018-05-01'].compute()

Given that you're grouping by a few columns though I'm not sure how well this will work.

edited Feb 19 '18 at 16:56

answered Feb 19 '18 at 16:35

MRocklin

55,641
23
163
235

The question is, what can I do inside the apply function? Can I save just the values of grouping columns? Or do I have access to all underlying rows within the group? – qwertz1123 Feb 19 '18 at 16:38
I have also tried doing smth like g.apply(lambda x: (x.Year,x.Month,x.Day)), but it's not really working.. – qwertz1123 Feb 19 '18 at 16:44
I've edited the answer above with more information. This works just like pandas groupby-apply. – MRocklin Feb 19 '18 at 16:56
Thank you for your example. I have tried it in pandas and dask. It works in pandas perfectly fine, but is not working in dask at all. – qwertz1123 Feb 20 '18 at 09:10
@MRocklin. I found several questions and answers related to dask, and tried to implement a method on this particular problem; but I cannot seem to figure out. Can you please look into this https://stackoverflow.com/questions/50178441/merge-multiple-dataframe-using-dask-and-then-write-files-with-groupby – everestial007 May 04 '18 at 18:56
Note groupby + apply does not have the same behavior in Dask vs Pandas: https://stackoverflow.com/a/60725384/145349 – fjsj Apr 10 '20 at 15:28

How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

1 Answers1