1

Ask

I would like to speed up a pandas groupby that is also applying a summation on two columns and have the resulting dataframe returned.

Code

df = df.groupby(['key','code','name','period','agg_metric'], sort=False, observed=True, dropna=False)[['metricA','metricB']]\
.sum().reset_index()

(The method currently takes 2 minutes to process the data for my largest use case.)

Data

Overall, the largest sized dataframe has about 1.5 million rows upon which the groupby is applied. Period and agg_metric can be inferred from each other, of which there are only 2 period values (and thus 2 agg_metric values). The name value can also be inferred from code.

After the groupby, I'm left with 700k records. If I understand correctly, the slowdown is due to the number of resultant groups which are processed. Is there a possible way to vectorize this method and apply the summation to each group at once, rather than what I'm assuming is currently being iterated.

Notes

I have tried using groupby().agg({...}) and groupby().apply(lambda), and both take about the same amount of time. I've also tried removing some of the groupby columns and then add them back later, but it did not speed up the calcuation, so it didn't warrant taking them out of the groupby. The snippet also has the sort=False and observed=True, but both did little to improve processing time.

I've thoroughly gone through as many resources I can (especially this great reference: General Groupby in Python Pandas: Fast way). I'm fairly new to vectorization, and am doing this as I'm off-loading several queries from our DB.

Brent
  • 110
  • 8
  • Have you compared the performance with pd.pivot_table()? – Yashar Ahmadov Nov 07 '21 at 10:58
  • I just ran two tests, and pd.pivot_table() performed nearly the same at 122 seconds. Thank you for the tip. I haven't explored pivot_table fx yet. – Brent Nov 07 '21 at 11:22
  • In general, Python is not the best option if you are focusing on the speed. Some time ago I also had to investigate this topic and collected my findings in this blog post: https://yasharahmadov.com/2020/04/16/r-vs-python-for-big-data/ – Yashar Ahmadov Nov 07 '21 at 11:25
  • 1
    Great read, thanks! I have the script ready to run in parallel with concurrent.futures, so if one task takes ~2 minutes, it's not the end of the world but I'm just trying to ensure I get everything as optimized as possible using the worst case scenario. – Brent Nov 07 '21 at 11:39
  • 1
    Provide simple reproducible example, some random data of the same type and size, it will be easier for us to help you then – dankal444 Nov 07 '21 at 15:37
  • Check if `reset_index()` causes some overhead. You could use `as_index=False` of `groupby`. You could even normalize data like relational databases and have groupings only be integer IDs with their lookup values in different data frames to be merged later. – Parfait Nov 07 '21 at 17:06

1 Answers1

1

What type of data do you have? It looks like the columns metricA / metricB are of type object, and pandas performs slow summation for Python objects rather than fast summation for numpy arrays. Try to convert metric columns to float64 or integer type.

You can inspect data types using df.info() method.

Proof:

from string import ascii_letters
from time import time

import numpy as np
import pandas as pd
from numpy.random import choice

N = 1_500_000
np.random.seed(123)
letters = list(ascii_letters)
words = ["".join(choice(letters, 5)) for i in range(30)]

df = pd.DataFrame(
    {
        "key": choice(words, N),
        "code": choice(words, N),
        "name": choice(words, N),
        "period": np.random.randint(0, 10, N),
        "agg_metric": choice(["mean", "sum", "count"], N),
        "metricA": np.random.rand(N),
        "metricB": np.random.rand(N),
    }
)

def aggregate(df):
    return (
        df.groupby(
            ["key", "code", "name", "period", "agg_metric"],
            sort=False,
            observed=True,
            dropna=False,
        )[["metricA", "metricB"]]
        .sum()
        .reset_index()
    )

start = time()
df2 = aggregate(df)
print(f"sum floats took {time() - start}")

start = time()
df3 = aggregate(df.astype({"metricA": object, "metricB": object}))
print(f"sum objects took {time() - start}")

assert df2.equals(df3)

Output:

sum floats took 0.2983248233795166
sum objects took 81.04267287254333
Alexander Volkovsky
  • 2,588
  • 7
  • 13
  • 3 seconds flat, I'm absolutely floored! Thank you! I was using Decimal throughout for precision, which is type object as you highlighted. Before any calculations, the metrics are integers, so I converted them using pd.to_numeric(). Your insight is significantly better performance than the 75% improvement I was able to achieve using techniques from this article with np.vectorization, though it may be useful for others later on. [Optimizing Pandas Gropuby...](https://medium.com/@aivinsolatorio/optimizing-pandas-groupby-50x-using-numpy-to-speedup-an-agent-based-model-a-story-of-8b0d25614915) – Brent Nov 07 '21 at 22:43