how can I get all unique groups in Dask from grouped data frame? Let's say, we have the following code:
g = df.groupby(['Year', 'Month', 'Day'])
I have to iterate through all groups and process the data within the groups. My idea was to get all unique value combinations and then iterate through the collection and call e.g.
g.get_group((2018,01,12)).compute()
for each of them... which is not going to be fast, but hopefully will work..
In Spark/Scala I can achieve smth like this using the following approach:
val res = myDataFrame.groupByKey(x => groupFunctionWithX(X)).mapGroups((key,iter) => {
process group with all the child records
} )
I am wondering, what is the best way to implement smth like this using Dask/Python?
Any assistance would be greatly appreciated!
Best, Michael
UPDATE
I have tried the following in python with pandas:
df = pd.read_parquet(path, engine='pyarrow')
g = df.groupby(('Year', 'Month', 'Day'))
g.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
And this was working perfectly fine. Afterwards, I have tried the same with Dask:
df2 = dd.read_parquet(path, engine='pyarrow')
g2 = df2.groupby(('Year', 'Month', 'Day'))
g2.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
This has led me to the following error:
ValueError: Metadata inference failed in `groupby.apply(lambda)`.
Any ideas what went wrong?