9

I'm trying to use the rolling() function on a pandas data frame with monthly data. However, I dropped some NaN values, so now there are some gaps in my time series. Therefore, the basic window parameter gives a misleading answer since it just looks at the previous observation:

import pandas as pd
import numpy as np
import random
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()
dft['basic'] = dft['value'].rolling(2).sum()

See, for example the 2017-08-31 entry, which sums 3.0 and 9.0, but the previous entry is 2017-03-31.

In [57]: dft.tail()
Out[57]:
            value  basic
2017-02-28    8.0   12.0
2017-03-31    3.0   11.0
2017-08-31    9.0   12.0
2017-10-31    7.0   16.0
2017-11-30    7.0   14.0

The natural solution (I thought) is to use a '2M' offset, but it gives an error:

In [58]: dft['basic2M'] = dft['value'].rolling('2M').sum()
...<output omitted>...
ValueError: <2 * MonthEnds> is a non-fixed frequency

If I move the Daily offset, I can get it to work, but this seems like an odd workaround:

In [59]: dft['basic32D'] = dft['value'].rolling('32D', min_periods=2).sum()

In [61]: dft.tail()
Out[61]:
            value  basic  basic32D
2017-02-28    8.0   12.0      12.0
2017-03-31    3.0   11.0      11.0
2017-08-31    9.0   12.0       NaN
2017-10-31    7.0   16.0       NaN
2017-11-30    7.0   14.0      14.0

I also tried converting to a PeriodIndex:

dfp = dft.to_period(freq='M')

but this gives the same error:

dfp['basic2M'] = dfp['value'].rolling('2M').sum()

and this is very unexpected:

dfp['basic32Dp'] = dfp['value'].rolling('32D', min_periods=2).sum()
In [68]: dfp
Out[68]:
         value  basic  basic32D  basic32Dp
2016-02    9.0    NaN       NaN        NaN
2016-03    3.0   12.0      12.0       12.0
2016-04    7.0   10.0      10.0       19.0
2016-05    3.0   10.0      10.0       22.0
2016-06    4.0    7.0       7.0       26.0
2016-07    7.0   11.0      11.0       33.0
2016-08    3.0   10.0      10.0       36.0
2016-09    9.0   12.0      12.0       45.0
2016-11    5.0   14.0       NaN       50.0
2017-01    4.0    9.0       NaN       54.0
2017-02    8.0   12.0      12.0       62.0
2017-03    3.0   11.0      11.0       65.0
2017-08    9.0   12.0       NaN       74.0
2017-10    7.0   16.0       NaN       81.0
2017-11    7.0   14.0      14.0       88.0

The '32D' offset with the 'M' period index seems to be treated as '32M' perhaps? It appears to just be an expanding sum for the entire series.

Perhaps I'm misunderstanding how to use offsets? Obviously, I could solve this by keeping the NaN in the original value column and just use the window parameter, but offsets seem quite useful.

For what its worth, if I generate Hourly data with a DateTimeIndex, things seem to work as expected (i.e. a '2D' offset with data every 12 hours gives the correct answer across missing rows).

cs95
  • 379,657
  • 97
  • 704
  • 746
Jesse Blocher
  • 523
  • 1
  • 4
  • 16
  • The problem with a variable sized window is that there is no one obviously superior way to apply it when the window spans two months with different lengths. – Mad Physicist Jun 05 '18 at 14:57
  • I understand. That is why I tried the PeriodIndex - that seems much simpler to me because it is just a basic counter behind it - 12 months per year. – Jesse Blocher Jun 05 '18 at 15:00
  • The initial code block that you have given results in an error "NameError: name 'dt' is not defined" – Sid Kwakkel Feb 05 '21 at 01:01

2 Answers2

1

Here is a function that gives you the rolling sum of a specified number of months. You did not provide variable 'dt' in your code above so I just created a list of datetimes (code included).

from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
import numpy as np
import random

def date_range(start_date, end_date, increment, period):
    result = []
    nxt = start_date
    delta = relativedelta(**{period:increment})
    while nxt <= end_date:
        result.append(nxt)
        nxt += delta
    return result

def MonthRollSum(df, offset, sumColumn):
    #must have DateTimeIndex
    df2 = df.copy()
    df2.index = df2.index + pd.DateOffset(days = -offset)
    return df2.groupby([df2.index.year, df2.index.month])[sumColumn].sum()

# added this part to generate the dt list for 8hour interval for 1000 days
start_date = datetime.now()
end_date = start_date + relativedelta(days=1000)
end_date = end_date.replace(hour=19, minute=0, second=0, microsecond=0)
dt = date_range(start_date, end_date, 8, 'hours')

# the following was given by the questioner
dft = pd.DataFrame(np.random.randint(0,10,size=len(dt)),index=dt)
dft.columns = ['value']
dft['value'] = np.where(dft['value'] < 3,np.nan,dft['value'])
dft = dft.dropna()

# Call the solution function
dft = MonthRollSum(dft, 2, 'value')
dft

The results many vary because the initial list of value is randomly generated:

2021  2     290.0
      3     379.0
      4     414.0
      5     368.0
      6     325.0
      7     405.0
      8     425.0
      9     380.0
      10    393.0
      11    370.0
      12    419.0
2022  1     377.0
      2     275.0
      3     334.0
      4     350.0
      5     395.0
      6     376.0
      7     420.0
      8     419.0
      9     359.0
      10    328.0
      11    394.0
      12    345.0
2023  1     381.0
      2     335.0
      3     352.0
      4     355.0
      5     376.0
      6     350.0
      7     401.0
      8     443.0
      9     394.0
      10    394.0
Sid Kwakkel
  • 749
  • 3
  • 11
  • 31
1

This worked for me, using 30D instead of 1M

df_px = df_px.set_index(pd.to_datetime(df_px['date']))
df_px['px_avg30d']=df_px.groupby('stock')['px'].transform(lambda x: x.rolling('30D').mean())
citynorman
  • 4,918
  • 3
  • 38
  • 39