Ask
I would like to speed up a pandas groupby that is also applying a summation on two columns and have the resulting dataframe returned.
Code
df = df.groupby(['key','code','name','period','agg_metric'], sort=False, observed=True, dropna=False)[['metricA','metricB']]\
.sum().reset_index()
(The method currently takes 2 minutes to process the data for my largest use case.)
Data
Overall, the largest sized dataframe has about 1.5 million rows upon which the groupby is applied. Period and agg_metric can be inferred from each other, of which there are only 2 period values (and thus 2 agg_metric values). The name value can also be inferred from code.
After the groupby, I'm left with 700k records. If I understand correctly, the slowdown is due to the number of resultant groups which are processed. Is there a possible way to vectorize this method and apply the summation to each group at once, rather than what I'm assuming is currently being iterated.
Notes
I have tried using groupby().agg({...})
and groupby().apply(lambda)
, and both take about the same amount of time. I've also tried removing some of the groupby columns and then add them back later, but it did not speed up the calcuation, so it didn't warrant taking them out of the groupby. The snippet also has the sort=False and observed=True, but both did little to improve processing time.
I've thoroughly gone through as many resources I can (especially this great reference: General Groupby in Python Pandas: Fast way). I'm fairly new to vectorization, and am doing this as I'm off-loading several queries from our DB.