Pandas shift datetimeindex takes too long time running

Question

I have a running time issue with shifting a large dataframe with datetime index.

Example using created dummy data:

df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)

If I shift not using datetimeindex, it only takes about 12s:

%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s

But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:

%%time    
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s

In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge. Is there any solution for this? Thanks for your answer, will appreciate so much any respond.

Ben's answer solves it:

%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s

also implement to the looping:

%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index() 
Wall time: 1min 44s

Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.

score 2 · Accepted Answer · answered Aug 23 '21 at 15:56

The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.

One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.

from pandas.tseries.offsets import MonthEnd

res = (
    pd.concat([
        df.assign(col2 = lambda x: x['col2']  + MonthEnd(i))
          .set_index(['col3', 'col2'])
          .add_suffix(f'_{i}')
        for i in range(0,7)], 
        axis=1)
      .fillna(0) 
      # depends on your original data
      .astype(df['col1'].dtype) 
      # if you want a left merge ordered like original df
      #.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
      # if you want col2 and col3 back as columns
      # .reset_index() 
)

Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.

print(res.head(10))
                 col1_0  col1_1  col1_2  col1_3  col1_4  col1_5  col1_6
col3 col2                                                              
0    2020-01-31       7       0       0       0       0       0       0
     2020-02-29       8       7       0       0       0       0       0
     2020-03-31       2       8       7       0       0       0       0
     2020-04-30       3       2       8       7       0       0       0
     2020-05-31       4       3       2       8       7       0       0
     2020-06-30      12       4       3       2       8       7       0
     2020-07-31      13      12       4       3       2       8       7
     2020-08-31       0      13      12       4       3       2       8
     2020-09-30       0       0      13      12       4       3       2
     2020-10-31       0       0       0      13      12       4       3

Hi @Ben.T, Do you have any idea for this issue? https://stackoverflow.com/questions/70697366/shift-datetime-many-times-and-left-join-them-all-in-pyspark — zonna, Jan 13 '22 at 13:24

score 1 · Answer 2 · answered Aug 23 '21 at 16:33

1

This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.

The relevant code in for DataFrame.GroupBy.shift is:

def shift(self, periods=1, freq=None, axis=0, fill_value=None):
    """..."""
    if freq is not None or axis != 0:
        return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))

Previously this issue extended to specifying a fill_value

answered Aug 23 '21 at 16:33

ALollz

57,915
7
66
89

oww, I see. Thanks – zonna Aug 23 '21 at 17:13
Yeah, used to be an issue too with `fill_value` but they have improved the function since `0.24`. Perhaps one day they'll figure this out, but my general sense is that given the irregularity of calendar addition it's not simple/possible. Though perhaps using a proxy like 0 is Dec 2010, 1 is Jan 2011, you'd be able to use a normal `.shift` and then map the values back afterwards. – ALollz Aug 23 '21 at 17:22
I try `scikit-learn` `LabelEncoder` the `col2` that contains `datetime` and set it as index before implement a normal `.shift` without using `freq` but it keeps takes too long running time. – zonna Aug 23 '21 at 18:47

Pandas shift datetimeindex takes too long time running

2 Answers2