1

This is best explained through an example.

I have the following dataframe (each row can be thought of as a transaction):

DATE        AMOUNT
2017-01-29  10 
2017-01-30  20
2017-01-31  30
2017-02-01  40
2017-02-02  50
2017-02-03  60

I would like to compute a 2-day rolling sum but only for rows in February.

Code snippet I have currently:

df.set_index('DATE',inplace=True)

res=df.rolling('2d')['AMOUNT'].sum()

which gives:

            AMOUNT 
2017-01-29  10 
2017-01-30  30 
2017-01-31  50 
2017-02-01  70 
2017-02-02  90 
2017-02-03  110

but I really only need the output in the last 3 rows, the operations on the first 3 rows are unnecessary. When the dataframe is huge, this incurs immense time complexity. How do I compute the rolling sum only for the last 3 rows (other than computing the rolling sum for all rows and then doing a row filtering operation after that)?

*I cannot pre-filter the dataframe either because there wouldn't be the 'lookback' period in January for the correct rolling sum value to be obtained.

cs95
  • 379,657
  • 97
  • 704
  • 746
Zi Xiang
  • 21
  • 5
  • What is expected output? There are data with multiple years? – jezrael May 24 '18 at 07:00
  • @jezrael I'm dealing with a dataset of a few million transactions over 2 years but I only require rolling sum values computed for the second year. The rolling window is 1 year. Thus when I apply the rolling function to all rows, there's 'wasted' computation for a year worth of transactions. – Zi Xiang May 24 '18 at 07:18

2 Answers2

0

You can use timedelta to filter your df and keep the last day of January.

import datetime  

dateStart = datetime.date(2017, 2, 1) - datetime.timedelta(days=1)
dateEnd = datetime.date(2017, 2, 3)
df.loc[dateStart:dateEnd]

Then you can do your rolling operation and drop the first line (which is 2017-01-31)

Charles R
  • 1,621
  • 1
  • 8
  • 25
  • I did consider this, but the problem is that when my rolling window increases to say 1 year, I'd have to keep an extra year of transactions. Ideally I want to compute the rolling sum _only_ for the February transactions. – Zi Xiang May 24 '18 at 07:29
  • An other way is to create a column with the value of days -1 : df['AMOUT1'] = df.Amout.shift(1) and then filter the df and do df.AMOUT = df.AMOUT + df.AMOUT1 – Charles R May 24 '18 at 08:07
  • df['AMOUNT1'] = df.AMOUNT.shift(1) assumes the rolling window is a fixed number of rows. Not sure how the shifting should be implemented with a time-based window (unless I resample the dataframe such that every row corresponds to 1 day). Does get me what I want but the additional operations could be more time eating than saving. – Zi Xiang May 24 '18 at 19:06
0

you can just compute the rolling sum only for the last rows by using tail(4)

res = df.tail(4).rolling('2d')['AMOUNT'].sum()

Output:

DATE
2017-01-31      NaN
2017-02-01     70.0
2017-02-02     90.0
2017-02-03    110.0
Name: AMOUNT, dtype: float64

If you want to merge those values - excluding 2017-01-31 then you can do the following:

df.loc[res.index[1:]] = res.tail(3)

Output:

            AMOUNT
DATE    
2017-01-29  10.0
2017-01-30  20.0
2017-01-31  30.0
2017-02-01  70.0
2017-02-02  90.0
2017-02-03  110.0
gyx-hh
  • 1,421
  • 1
  • 10
  • 15
  • That gives me 40 for the 2017-02-01 row, which is incorrect (should be 70). – Zi Xiang May 24 '18 at 19:00
  • Hmm ok you're right.. but I think you just needed to do tail(4)? But I'm on my phone now so I didn't test it. – gyx-hh May 24 '18 at 19:25
  • @ZiXiang doing `tail(4)` does give the data you wanted. I have also edited the answer to show you how to add `res` to the main df if this is desired. – gyx-hh May 25 '18 at 10:12