1

I have the following data:

columns = ['aircraft_id',      'Liftoff',    'timestamp',      'value']

l =[
(   '0003177d',1550000476500,1550000467000,  -80.15625),
(   '0003177d',1550000476500,1550000467500,  -80.15625),
(   '0003177d',1550000476500,1550000468000,  -80.15625),
(   '0003177d',1550000476500,1550000468500,  -80.15625),
(   '0003177d',1550000476500,1550000469000,-79.8046875),
(   '0003177d',1550000476500,1550000469500,-79.8046875),
(   '0003177d',1550000476500,1550000470000,-79.8046875),
(   '0003177d',1550000476500,1550000470500,-79.8046875),
(   '0003177d',1550000476500,1550000471000,-79.8046875),
(   '0003177d',1550000476500,1550000471500,-79.8046875),
(   '0003177d',1550000476500,1550000472000,  -80.15625),
(   '0003177d',1550000476500,1550000472500,-80.5078125),
(   '0003177d',1550000476500,1550000473000, -80.859375),
(   '0003177d',1550000476500,1550000473500, -80.859375),
(   '0003177d',1550000476500,1550000474000, -80.859375),
(   '0003177d',1550000476500,1550000474500, -80.859375),
(   '0003177d',1550000476500,1550000475000, -80.859375),
(   '0003177d',1550000476500,1550000475500, -80.859375),
(   '0003177d',1550000476500,1550000476000, -80.859375),
(   '0003177d',1550000476500,1550000476500,-80.5078125)]

df=spark.createDataFrame(l, columns)

df.show()
+-----------+-------------+-------------+-----------+
|aircraft_id|      Liftoff|    timestamp|      value|
+-----------+-------------+-------------+-----------+
|   0003177d|1550000476500|1550000467000|  -80.15625|
|   0003177d|1550000476500|1550000467500|  -80.15625|
|   0003177d|1550000476500|1550000468000|  -80.15625|
|   0003177d|1550000476500|1550000468500|  -80.15625|
|   0003177d|1550000476500|1550000469000|-79.8046875|
|   0003177d|1550000476500|1550000469500|-79.8046875|
|   0003177d|1550000476500|1550000470000|-79.8046875|
|   0003177d|1550000476500|1550000470500|-79.8046875|
|   0003177d|1550000476500|1550000471000|-79.8046875|
|   0003177d|1550000476500|1550000471500|-79.8046875|
|   0003177d|1550000476500|1550000472000|  -80.15625|
|   0003177d|1550000476500|1550000472500|-80.5078125|
|   0003177d|1550000476500|1550000473000| -80.859375|
|   0003177d|1550000476500|1550000473500| -80.859375|
|   0003177d|1550000476500|1550000474000| -80.859375|
|   0003177d|1550000476500|1550000474500| -80.859375|
|   0003177d|1550000476500|1550000475000| -80.859375|
|   0003177d|1550000476500|1550000475500| -80.859375|
|   0003177d|1550000476500|1550000476000| -80.859375|
|   0003177d|1550000476500|1550000476500|-80.5078125|
+-----------+-------------+-------------+-----------+

I want to calculate the mean of the value inside a window where the range between of the window depends on the current value of the timestamp to the timestamp of the Liftoff. Each aircraft has different values of Liftoff.

I try:

from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('val', F.mean('value').over(Window.partitionBy('aircraft_id','ini_TO','Liftoff').orderBy('timestamp').rangeBetween(df['timestamp'], df['Liftoff']))

But it doesn't work, is there a solution?

Expected result:

  • For the first row the range of the window is from 1550000467000 to 1550000476500, so the mean is the sum of the 20 values and divided by 20 (-80,33203).
  • For the second row the range of the window is from 1550000467500 to 1550000476500, so the mean is the sum of the 19 values and divided by 19 (-80,34128|).
  • Etc...
+-----------+-------------+-------------+---------+---------+
|aircraft_id|      Liftoff|    timestamp|    value|      val|
+-----------+-------------+-------------+---------+---------+
|   0003177d|1550000476500|1550000467000|-80,15625|-80,33203|
|   0003177d|1550000476500|1550000467500|-80,15625|-80,34128|
|   0003177d|1550000476500|1550000468000|-80,15625|-80,35156|
|   0003177d|1550000476500|1550000468500|-80,15625|-80,36305|
|   0003177d|1550000476500|1550000469000|-79,80469|-80,37598|
|   0003177d|1550000476500|1550000469500|-79,80469|-80,41406|
|   0003177d|1550000476500|1550000470000|-79,80469|-80,45759|
|   0003177d|1550000476500|1550000470500|-79,80469|-80,50781|
|   0003177d|1550000476500|1550000471000|-79,80469|-80,56641|
|   0003177d|1550000476500|1550000471500|-79,80469|-80,63565|
|   0003177d|1550000476500|1550000472000|-80,15625|-80,71875|
|   0003177d|1550000476500|1550000472500|-80,50781|-80,78125|
|   0003177d|1550000476500|1550000473000|-80,85938|-80,81543|
|   0003177d|1550000476500|1550000473500|-80,85938|-80,80915|
|   0003177d|1550000476500|1550000474000|-80,85938|-80,80078|
|   0003177d|1550000476500|1550000474500|-80,85938|-80,78906|
|   0003177d|1550000476500|1550000475000|-80,85938|-80,77148|
|   0003177d|1550000476500|1550000475500|-80,85938|-80,74219|
|   0003177d|1550000476500|1550000476000|-80,85938|-80,68359|
|   0003177d|1550000476500|1550000476500|-80,50781|-80,50781|
+-----------+-------------+-------------+---------+---------+
Pizzacat
  • 43
  • 3
  • Here is the reference link to calculate mean values for a *pyspark* data frames: https://stackoverflow.com/questions/47995188/how-to-calculate-mean-and-standard-deviation-given-a-pyspark-dataframe . You can also check this link for reference - https://stackoverflow.com/questions/44382822/pyspark-add-the-average-as-a-new-column-to-dataframe – Muhammad Usman Bashir Nov 11 '19 at 13:30
  • [rangebetween](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rangeBetween) takes two integer values to refer to rows relative of the current row. It can't handle columns. Can you please show us the expected output of your example? – cronoik Nov 11 '19 at 22:23
  • @cronoik, I edit the post with the expected result, also I edit the values of the Liftoff to make more easy the example. Thank you for the help. – Pizzacat Nov 12 '19 at 09:08

1 Answers1

1

I think you are almost there, you just need to set rangeBetween in window spec to start from current row Window.currentRow and go till end of the window range Window.unboundedFollowing like below:

Note : ini_TO is not provided in sample dataset so removed from partitionBy for testing.

wind_spec = Window.partitionBy('aircraft_id','Liftoff').orderBy('timestamp').rangeBetween(Window.currentRow, Window.unboundedFollowing)

Above window will give desired output :

df.withColumn('val', F.mean('value').over(wind_spec)).show()
+-----------+-------------+-------------+-----------+------------------+
|aircraft_id|      Liftoff|    timestamp|      value|               val|
+-----------+-------------+-------------+-----------+------------------+
|   0003177d|1550000476500|1550000467000|  -80.15625|      -80.33203125|
|   0003177d|1550000476500|1550000467500|  -80.15625|-80.34128289473684|
|   0003177d|1550000476500|1550000468000|  -80.15625|       -80.3515625|
|   0003177d|1550000476500|1550000468500|  -80.15625|-80.36305147058823|
|   0003177d|1550000476500|1550000469000|-79.8046875|    -80.3759765625|
|   0003177d|1550000476500|1550000469500|-79.8046875|       -80.4140625|
|   0003177d|1550000476500|1550000470000|-79.8046875|-80.45758928571429|
|   0003177d|1550000476500|1550000470500|-79.8046875|       -80.5078125|
|   0003177d|1550000476500|1550000471000|-79.8046875|      -80.56640625|
|   0003177d|1550000476500|1550000471500|-79.8046875| -80.6356534090909|
|   0003177d|1550000476500|1550000472000|  -80.15625|         -80.71875|
|   0003177d|1550000476500|1550000472500|-80.5078125|         -80.78125|
|   0003177d|1550000476500|1550000473000| -80.859375|    -80.8154296875|
|   0003177d|1550000476500|1550000473500| -80.859375|-80.80915178571429|
|   0003177d|1550000476500|1550000474000| -80.859375|      -80.80078125|
|   0003177d|1550000476500|1550000474500| -80.859375|       -80.7890625|
|   0003177d|1550000476500|1550000475000| -80.859375|     -80.771484375|
|   0003177d|1550000476500|1550000475500| -80.859375|       -80.7421875|
|   0003177d|1550000476500|1550000476000| -80.859375|      -80.68359375|
|   0003177d|1550000476500|1550000476500|-80.5078125|       -80.5078125|
+-----------+-------------+-------------+-----------+------------------+
SMaZ
  • 2,515
  • 1
  • 12
  • 26