0

I am trying to reset a rolling average based on a signal condition. From the point where the condition is true onwards, the previous values should be ignored. This can be done by replacing the history of prior values by the current value at the signal.

idx = pd.date_range(start='2000-01-01',end='2000-01-11')
#df = pd.DataFrame(np.random.normal(size=(len(idx),1)).cumsum(), index=idx, columns=['value'])
df = pd.DataFrame(range(0, 11), index=idx, columns=['value'])
df['signal'] = [0] * 5 + [1] + [0] * 5
df['ma'] = df.value.rolling(3).mean()
df['value2'] = [5] * 6 + list(range(6, 11))
df['ma2'] = df.value2.rolling(3).mean()

            value  signal   ma  value2       ma2
2000-01-01      0       0  NaN       5       NaN
2000-01-02      1       0  NaN       5       NaN
2000-01-03      2       0  1.0       5  5.000000
2000-01-04      3       0  2.0       5  5.000000
2000-01-05      4       0  3.0       5  5.000000
2000-01-06      5       1  4.0       5  5.000000
2000-01-07      6       0  5.0       6  5.333333
2000-01-08      7       0  6.0       7  6.000000
2000-01-09      8       0  7.0       8  7.000000
2000-01-10      9       0  8.0       9  8.000000
2000-01-11     10       0  9.0      10  9.000000

In this example value2 is the value the moving average should be calculated on once the signal fires, and ma2 would be the expected result from index 2000-01-06 onward. Up to 2000-01-05 it should keep the original 'ma' values. (values predating the signal should not be affected)

I found a similar request here pandas rolling average with a rolling mask / excluding entries that may be useful but I can't figure out how to apply it to my need.

Boaz
  • 58
  • 2
  • 6

2 Answers2

0
  • effectively you want to apply() for each groupby()
  • groupby() uses the signal cumsum() to generate two groups in your example. This assumes signal zero is significant
  • there is different between your sample incomplete out put and way code works. rolling(4) will generate 3 NaN for each rolling mean calc
df.loc[:,["ma","ma2"]] = df.groupby(df["signal"].cumsum()).apply(lambda d: d.loc[:,["value","value2"]].rolling(4).mean()).values
df
value signal ma value2 ma2
2000-01-01 00:00:00 0 0 nan 0 nan
2000-01-02 00:00:00 1 0 nan 5 nan
2000-01-03 00:00:00 2 0 nan 5 nan
2000-01-04 00:00:00 3 0 1.5 5 3.75
2000-01-05 00:00:00 4 0 2.5 5 5
2000-01-06 00:00:00 5 1 nan 5 nan
2000-01-07 00:00:00 6 0 nan 6 nan
2000-01-08 00:00:00 7 0 nan 7 nan
2000-01-09 00:00:00 8 0 6.5 8 6.5
2000-01-10 00:00:00 9 0 7.5 9 7.5
2000-01-11 00:00:00 10 0 8.5 10 8.5
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • thanks, that was quick! Unfortunately your proposal doesn't fulfill the need. NaN's are not allowed, we need an immediate valid value. That is why we backfill from the signal backwards with the current value. The goal is to restart the ma at the current value. Signal zero is not significant, might as well be NaN. – Boaz Aug 01 '21 at 16:16
0

Rob's answer using df.groupby(df["signal"].cumsum()) is nice. If your problem is the remaining NaNs, that's easily fixed by understanding the pd.rolling function. See the docs here.

Sounds like you want an unlimited window size, which calculates even with one input. So you just need to replace your rolling(4) with:

rolling(df.shape[0], min_periods=1)

where the df.shape[0] is the number of entries in your dataframe, to make sure the rolling window can be as large as the size of the dataframe.

(Posting as an answer because I don't have enough reputation to comment...)