9

Is there a vectorized operation to calculate the cumulative and rolling standard deviation (SD) of a Python DataFrame?

For example, I want to add a column 'c' which calculates the cumulative SD based on column 'a', i.e. in index 0, it shows NaN due to 1 data point, and in index 1, it calculates SD based on 2 data points, and so on.

The same question goes to rolling SD too. Is there an efficient way to calculate without iterating through df.itertuples()?

import numpy as np
import pandas as pd

def main():
    np.random.seed(123)
    df = pd.DataFrame(np.random.randn(10, 2), columns=['a', 'b'])
    print(df)

if __name__ == '__main__':
    main()
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
Roy
  • 507
  • 10
  • 22

2 Answers2

15

For cumulative SD base on columna 'a', let's use rolling with a windows size the length of the dataframe and min_periods = 2:

df['a'].rolling(len(df),min_periods=2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.691916
3 -2.426679 -0.428913  1.133892
4  1.265936 -0.866740  1.395750
5 -0.678886 -0.094709  1.250335
6  1.491390 -0.638902  1.374933
7 -0.443982 -0.434351  1.274843
8  2.205930  2.186786  1.450563
9  1.004054  0.386186  1.403721

And for rolling SD based on two values at a time:

df['c'] = df['a'].rolling(2).std()

Output:

          a         b         c
0 -1.085631  0.997345       NaN
1  0.282978 -1.506295  0.967753
2 -0.578600  1.651437  0.609228
3 -2.426679 -0.428913  1.306789
4  1.265936 -0.866740  2.611073
5 -0.678886 -0.094709  1.375197
6  1.491390 -0.638902  1.534617
7 -0.443982 -0.434351  1.368514
8  2.205930  2.186786  1.873771
9  1.004054  0.386186  0.849855
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
7

I think, if by rolling you mean cumulative, then the right term in Pandas is expanding:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding

It also accepts a min_periods argument.

df['c'] = df['a'].expanding(2).std()

The case for rolling was handled by Scott Boston, and it is unsurprisingly called rolling in Pandas.

The advantage if expanding over rolling(len(df), ...) is, you don't need to know the len in advance. It is very useful e.g. in groupby dataframes.

Tomasz Gandor
  • 8,235
  • 2
  • 60
  • 55