How to improve groupby/apply efficiency

Question

I have a pandas DataFrame with years as index, one column with stock ID, a second column with returns. The DataFrame has ~200k rows. I want to add 3 additional columns, with the cumulative returns of each stock in the next 5, 10 and 20 years respectively. To this purpose, I am grouping by the ID column and applying a function to the grouped object, which I show in a simple example below. I knew this was going to take some time, but as of now the code has been in execution for 23 hours and is still running.

I have 2 questions then:

Why exactly is python taking so much time to execute the code? Where is the bottleneck?
Any ideas on how can I change the code to make it faster?

Here is my code, applied to a simpler example.

In [1]: import pandas as pd

In [2]: simple_df = pd.DataFrame([[1,1,1,2,2],[0.1,0.05,0.15,0.3,0.2]], columns=[2010,2011,2012,2011,2012], index=['ID','Return']).T

In [3]: simple_df
Out[3]: 
       ID  Return
2010  1.0    0.10
2011  1.0    0.05
2012  1.0    0.15
2011  2.0    0.30
2012  2.0    0.20

In [4]: grouped = simple_df.groupby('ID', sort=False)

In [5]: create_df = lambda x: pd.DataFrame({i: x.Return.shift(-i) for i in range(0,3)})

In [6]: df_1 = grouped.apply(create_df)

In [7]: df_1
Out[7]: 
         0     1     2
2010  0.10  0.05  0.15
2011  0.05  0.15   NaN
2012  0.15   NaN   NaN
2011  0.30  0.20   NaN
2012  0.20   NaN   NaN

In [8]: df_2 =(df_1+1).cumprod(axis=1)-1

In [9]: df_2
Out[9]: 
         0       1        2
2010  0.10  0.1550  0.32825
2011  0.05  0.2075      NaN
2012  0.15     NaN      NaN
2011  0.30  0.5600      NaN
2012  0.20     NaN      NaN

In [10]: simple_df['Return_3y'] = df_2.iloc[:,2]

In [11]: simple_df
Out[11]: 
       ID  Return  Return_3y
2010  1.0    0.10    0.32825
2011  1.0    0.05        NaN
2012  1.0    0.15        NaN
2011  2.0    0.30        NaN
2012  2.0    0.20        NaN

The problem is in the grouped.apply(create_df). Of course not in this simple example, but with the true DataFrame (which has ~200k rows and ~10k different groups) and changing the create_df function to "...for i in range(0, 20)" as I need the cumulative return up to 20 years in the future — danie, May 08 '18 at 09:24

score 0 · Accepted Answer · edited May 08 '18 at 09:34

0

Instead of apply, use DataFrameGroupBy.shift with concat:

np.random.seed(234)
N = 10000

idx = np.random.randint(1990, 2020, size=N)
simple_df = pd.DataFrame({'ID':np.random.randint(1000, size=N),
              'Return':np.random.rand(N)}, index=idx).sort_values('ID')

print (simple_df)

In [147]: %%timeit
     ...: grouped = simple_df.groupby('ID', sort=False)
     ...: create_df = lambda x: pd.DataFrame({i: x.Return.shift(-i) for i in range(0,3)})
     ...: df_1 = grouped.apply(create_df)
     ...: df_2 =(df_1+1).cumprod(axis=1)-1
     ...: 
1.01 s ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [148]: %%timeit
     ...: g = simple_df.groupby('ID', sort=False)
     ...: df2 = pd.concat([g['Return'].shift(-i) for i in range(3)], axis=1, keys=range(3))
     ...: df2 =(df2+1).cumprod(axis=1)-1
     ...: 
3.91 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited May 08 '18 at 09:34

usr

782
1
7
25

answered May 08 '18 at 09:23

jezrael

822,522
95
1,334
1,252

hi @jezrael, pls check https://stackoverflow.com/questions/50230233/how-to-reduce-part-of-a-dataframe-colunm-value-based-on-another-column – Pyd May 08 '18 at 09:26
@pyd - I solve another question, so I was too late :( Sorry. – jezrael May 08 '18 at 09:33
its ok , jpp answer works fine, is there any other pandorable way to do this? – Pyd May 08 '18 at 09:37
@pyd - There is problem need loop by each row, so now I have no better solution... – jezrael May 08 '18 at 09:38
is jpp solution fine? – Pyd May 08 '18 at 09:41
@pyd - I think yes :) – jezrael May 08 '18 at 09:42
@jezrael thank you for the answer, it seems to be what I was looking for. However, do you have any clue on why there is this big difference in performance? Aren't they doing the same thing under the hood? – danie May 08 '18 at 09:44
1

@Daniele - I think not, because you create `DataFrame` for each group with `Series.shift()`. In faster solution is used vectorized `DataFrameGroupBy.shift` and last `concat` is very fast, because not large amount of groups, but only `range(3)` output Series. – jezrael May 08 '18 at 09:46

How to improve groupby/apply efficiency

1 Answers1