I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2
and col3
. So I use for
looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0)
so I just use loop in range(1,7)
. I also implement to multiple columns so I don't use astype
and implement reindex
because I use left merge
.