I have the following data:
columns = ['aircraft_id', 'Liftoff', 'timestamp', 'value']
l =[
( '0003177d',1550000476500,1550000467000, -80.15625),
( '0003177d',1550000476500,1550000467500, -80.15625),
( '0003177d',1550000476500,1550000468000, -80.15625),
( '0003177d',1550000476500,1550000468500, -80.15625),
( '0003177d',1550000476500,1550000469000,-79.8046875),
( '0003177d',1550000476500,1550000469500,-79.8046875),
( '0003177d',1550000476500,1550000470000,-79.8046875),
( '0003177d',1550000476500,1550000470500,-79.8046875),
( '0003177d',1550000476500,1550000471000,-79.8046875),
( '0003177d',1550000476500,1550000471500,-79.8046875),
( '0003177d',1550000476500,1550000472000, -80.15625),
( '0003177d',1550000476500,1550000472500,-80.5078125),
( '0003177d',1550000476500,1550000473000, -80.859375),
( '0003177d',1550000476500,1550000473500, -80.859375),
( '0003177d',1550000476500,1550000474000, -80.859375),
( '0003177d',1550000476500,1550000474500, -80.859375),
( '0003177d',1550000476500,1550000475000, -80.859375),
( '0003177d',1550000476500,1550000475500, -80.859375),
( '0003177d',1550000476500,1550000476000, -80.859375),
( '0003177d',1550000476500,1550000476500,-80.5078125)]
df=spark.createDataFrame(l, columns)
df.show()
+-----------+-------------+-------------+-----------+
|aircraft_id| Liftoff| timestamp| value|
+-----------+-------------+-------------+-----------+
| 0003177d|1550000476500|1550000467000| -80.15625|
| 0003177d|1550000476500|1550000467500| -80.15625|
| 0003177d|1550000476500|1550000468000| -80.15625|
| 0003177d|1550000476500|1550000468500| -80.15625|
| 0003177d|1550000476500|1550000469000|-79.8046875|
| 0003177d|1550000476500|1550000469500|-79.8046875|
| 0003177d|1550000476500|1550000470000|-79.8046875|
| 0003177d|1550000476500|1550000470500|-79.8046875|
| 0003177d|1550000476500|1550000471000|-79.8046875|
| 0003177d|1550000476500|1550000471500|-79.8046875|
| 0003177d|1550000476500|1550000472000| -80.15625|
| 0003177d|1550000476500|1550000472500|-80.5078125|
| 0003177d|1550000476500|1550000473000| -80.859375|
| 0003177d|1550000476500|1550000473500| -80.859375|
| 0003177d|1550000476500|1550000474000| -80.859375|
| 0003177d|1550000476500|1550000474500| -80.859375|
| 0003177d|1550000476500|1550000475000| -80.859375|
| 0003177d|1550000476500|1550000475500| -80.859375|
| 0003177d|1550000476500|1550000476000| -80.859375|
| 0003177d|1550000476500|1550000476500|-80.5078125|
+-----------+-------------+-------------+-----------+
I want to calculate the mean of the value inside a window where the range between of the window depends on the current value of the timestamp to the timestamp of the Liftoff. Each aircraft has different values of Liftoff.
I try:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('val', F.mean('value').over(Window.partitionBy('aircraft_id','ini_TO','Liftoff').orderBy('timestamp').rangeBetween(df['timestamp'], df['Liftoff']))
But it doesn't work, is there a solution?
Expected result:
- For the first row the range of the window is from 1550000467000 to 1550000476500, so the mean is the sum of the 20 values and divided by 20 (-80,33203).
- For the second row the range of the window is from 1550000467500 to 1550000476500, so the mean is the sum of the 19 values and divided by 19 (-80,34128|).
- Etc...
+-----------+-------------+-------------+---------+---------+
|aircraft_id| Liftoff| timestamp| value| val|
+-----------+-------------+-------------+---------+---------+
| 0003177d|1550000476500|1550000467000|-80,15625|-80,33203|
| 0003177d|1550000476500|1550000467500|-80,15625|-80,34128|
| 0003177d|1550000476500|1550000468000|-80,15625|-80,35156|
| 0003177d|1550000476500|1550000468500|-80,15625|-80,36305|
| 0003177d|1550000476500|1550000469000|-79,80469|-80,37598|
| 0003177d|1550000476500|1550000469500|-79,80469|-80,41406|
| 0003177d|1550000476500|1550000470000|-79,80469|-80,45759|
| 0003177d|1550000476500|1550000470500|-79,80469|-80,50781|
| 0003177d|1550000476500|1550000471000|-79,80469|-80,56641|
| 0003177d|1550000476500|1550000471500|-79,80469|-80,63565|
| 0003177d|1550000476500|1550000472000|-80,15625|-80,71875|
| 0003177d|1550000476500|1550000472500|-80,50781|-80,78125|
| 0003177d|1550000476500|1550000473000|-80,85938|-80,81543|
| 0003177d|1550000476500|1550000473500|-80,85938|-80,80915|
| 0003177d|1550000476500|1550000474000|-80,85938|-80,80078|
| 0003177d|1550000476500|1550000474500|-80,85938|-80,78906|
| 0003177d|1550000476500|1550000475000|-80,85938|-80,77148|
| 0003177d|1550000476500|1550000475500|-80,85938|-80,74219|
| 0003177d|1550000476500|1550000476000|-80,85938|-80,68359|
| 0003177d|1550000476500|1550000476500|-80,50781|-80,50781|
+-----------+-------------+-------------+---------+---------+