I need to compute the days of delay for each id from a set of ids, following some rules:
- if delay == 0, cum_sum will be zero
- if delay == 1, it it will cumsum days while delay == 1
- on the sequence, when it find a delay == 0, it will restart the cumsum on the next delay == 1
I'm trying this:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import Window
windowval = Window.partitionBy('id').orderBy('date') \
.rangeBetween(Window.unboundedPreceding, 0)
df_w_cumsum = df \
.withColumn('cum_sum', f.when(f.col('delay') == 1, f.sum('delay').over(windowval)).otherwise(0))
df_w_cumsum \
.where(f.col('CCB') == '01' | f.col('CCB') == '02') \
.select('date', 'id', 'ccb', 'delay', 'cum_sum').show(50)
But the returning result is not what I want on cum_sum
column. It keep the sum from the last sum value.
+----------+---+-------------------+--------+-------+
| date| id| diff| delay|cum_sum|
+----------+---+-------------------+--------+-------+
|2019-03-25| 01| 0.0| 0| 0|
|2019-03-26| 01|-0.8009825176018239| 0| 0|
|2019-03-27| 01| -8.043299057243985| 0| 0|
|2019-03-28| 01| 10.44232245458943| 1| 1|
|2019-03-29| 01| -6.431268561932029| 0| 0|
|2019-03-30| 01| 12.057037757827857| 1| 2|
|2019-03-31| 01| 30.57613602073252| 1| 3|
|2019-04-01| 01| 23.813922489788638| 1| 4|
|2019-04-02| 01| -36.4545666689105| 0| 0|
|2019-04-03| 01| -65.2062636145838| 0| 0|
|2019-03-25| 02| 0.0| 0| 0|
|2019-03-26| 02| 19.404830156549906| 1| 1|
|2019-03-27| 02| 38.82952192094308| 1| 2|
|2019-03-28| 02| -4.405904377682418| 0| 0|
|2019-03-29| 02| 14.994416162227935| 1| 3|
|2019-03-30| 02| 34.41459369421227| 1| 4|
|2019-03-31| 02| 53.85464854268412| 1| 5|
|2019-04-01| 02| 37.27460105285991| 1| 6|
|2019-04-02| 02| -6.717583231493336| 0| 0|
|2019-04-03| 02| -7.210466068066125| 0| 0|
|2019-04-04| 02| 29.99314720820803| 1| 7|
|2019-04-05| 02| 23.96867653208119| 1| 8|
|2019-04-06| 02| -43.39803957296772| 0| 0|
+----------+---+-------------------+--------+-------+
What I really need is this cum_sum
result, restarting the sums when delay == 1
after delay == 0.
+----------+---+-------------------+--------+-------+
| date| id| diff| delay|cum_sum|
+----------+---+-------------------+--------+-------+
|2019-03-25| 01| 0.0| 0| 0|
|2019-03-26| 01|-0.8009825176018239| 0| 0|
|2019-03-27| 01| -8.043299057243985| 0| 0|
|2019-03-28| 01| 10.44232245458943| 1| 1|
|2019-03-29| 01| -6.431268561932029| 0| 0|
|2019-03-30| 01| 12.057037757827857| 1| 1|
|2019-03-31| 01| 30.57613602073252| 1| 2|
|2019-04-01| 01|-23.813922489788638| 0| 0|
|2019-04-02| 01| -36.4545666689105| 0| 0|
|2019-04-03| 01| -65.2062636145838| 0| 0|
|2019-03-25| 02| 0.0| 0| 0|
|2019-03-26| 02| 19.404830156549906| 1| 1|
|2019-03-27| 02| 38.82952192094308| 1| 2|
|2019-03-28| 02| -4.405904377682418| 0| 0|
|2019-03-29| 02| 14.994416162227935| 1| 1|
|2019-03-30| 02| 34.41459369421227| 1| 2|
|2019-03-31| 02| 53.85464854268412| 1| 3|
|2019-04-01| 02| 37.27460105285991| 1| 4|
|2019-04-02| 02| 56.717583231493336| 1| 5|
|2019-04-03| 02| 27.210466068066125| 1| 6|
|2019-04-04| 02| 29.99314720820803| 1| 7|
|2019-04-05| 02| -23.96867653208119| 0| 0|
|2019-04-06| 02| -43.39803957296772| 0| 0|
+----------+---+-------------------+--------+-------+
Can anyone help me to achieve this?
Many thanks!