1

I want to use a window which calculates the mean of the last 5 results before the current result.

For example, if I have a dataframe with results, the mean_last_5 would be as follows:

   Result    Mean_last_5
1. 4         NaN
2. 2         NaN
3. 6         NaN
4. 3         NaN
5. 2         NaN
6. 6         3.4
7. 3         3.8

The 6th row would be calculated as: (4+2+6+3+2)/5 = 3.4.

So in pandas terms, I would a rolling window of 5 with an shift of 1.

With PySpark I just can't figure out how to do this. Current code:

def mean_last_5(df):
    window = Window.partitionBy('Id').orderBy('year').rangeBetween(Window.currentRow-5, Window.currentRow)

    return df.withColumn('mean_last_5', sf.avg('result').over(window))

Error:

cannot resolve due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions:

Shaido
  • 27,497
  • 23
  • 70
  • 73
Niels Hoogeveen
  • 365
  • 4
  • 19

2 Answers2

2

Since you don't care about the values in the rows, only the offset to the currently row, you should use rowsBetween instead of rangeBetween. The difference is illustrated here: What is the difference between rowsBetween and rangeBetween?.

Note that for rowsBetween both values are inclusive, so to get the rolling mean for the previous 5 rows (not including the current one) you want to define the window as:

window = Window.partitionBy('Id').orderBy('year').rowsBetween(Window.currentRow-5, Window.currentRow-1)

or similarly:

window = Window.partitionBy('Id').orderBy('year').rowsBetween(-5, -1)

Then use it as before:

df.withColumn('mean_last_5', avg('result').over(window))
Shaido
  • 27,497
  • 23
  • 70
  • 73
0

Try this:

def mean_last_5(df):
    window = Window.partitionBy('Id').orderBy('year').rangeBetween(-1, -6)

    return df.withColumn('mean_last_5', sf.avg('result').over(window))

I've put 0 because I want the current row and -5 because I want the 5 previous including the actual result.

Manrique
  • 2,083
  • 3
  • 15
  • 38
  • Note that from the question example the current row should *not* be used in the calculation. – Shaido May 28 '19 at 01:54