I want to use a window which calculates the mean of the last 5 results before the current result.
For example, if I have a dataframe with results, the mean_last_5
would be as follows:
Result Mean_last_5
1. 4 NaN
2. 2 NaN
3. 6 NaN
4. 3 NaN
5. 2 NaN
6. 6 3.4
7. 3 3.8
The 6th row would be calculated as: (4+2+6+3+2)/5 = 3.4
.
So in pandas terms, I would a rolling window of 5 with an shift of 1.
With PySpark I just can't figure out how to do this. Current code:
def mean_last_5(df):
window = Window.partitionBy('Id').orderBy('year').rangeBetween(Window.currentRow-5, Window.currentRow)
return df.withColumn('mean_last_5', sf.avg('result').over(window))
Error:
cannot resolve due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions: