1

I would like to create a column of the average of the 5 highest values from a rolling window of 30. Using a for loop is very slow with a large DataFrame. I tried using rolling() with nlargest(), but it wouldn't work. Any suggestions for speeding this up?

def top_values(df, column, days):
    top5 = df.nlargest(days, column)
    top = top5[column].sum() / days

x = 0
w = 0
for i in df.index:
    if x > 30:
        df['tops'][x] = top_values(df[w:x], 'column', 5)
        w += 1
        x += 1
cs95
  • 379,657
  • 97
  • 704
  • 746
slard
  • 61
  • 1
  • 8

1 Answers1

4

One method would be to use a lambda function in your rolling, and for instance get the mean of the first 5 elements of a sorted list:

df['column'].rolling(30).apply(lambda x: np.mean(sorted(x,reverse=True)[:5]))

Minimal Example:

On a dataframe of 15 elements, we can get the mean value of the top 3 values in a window of 5 to demonstrate:

>>> df
    column
0       48
1        9
2       36
3       71
4       59
5       16
6        9
7       18
8       43
9        3
10      54
11      23
12      12
13      38
14      54

>>> df['column'].rolling(5).apply(lambda x: np.mean(sorted(x,reverse=True)[:3]))
0           NaN
1           NaN
2           NaN
3           NaN
4     59.333333
5     55.333333
6     55.333333
7     49.333333
8     40.000000
9     25.666667
10    38.333333
11    40.000000
12    40.000000
13    38.333333
14    48.666667
Name: column, dtype: float64
sacuL
  • 49,704
  • 8
  • 81
  • 106