0

I need to compute the days of delay for each id from a set of ids, following some rules:

  • if delay == 0, cum_sum will be zero
  • if delay == 1, it it will cumsum days while delay == 1
  • on the sequence, when it find a delay == 0, it will restart the cumsum on the next delay == 1

I'm trying this:

import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql import Window

windowval = Window.partitionBy('id').orderBy('date') \
             .rangeBetween(Window.unboundedPreceding, 0)

df_w_cumsum = df \
    .withColumn('cum_sum', f.when(f.col('delay') == 1, f.sum('delay').over(windowval)).otherwise(0))

df_w_cumsum \
    .where(f.col('CCB') == '01' | f.col('CCB') == '02') \
    .select('date', 'id', 'ccb', 'delay', 'cum_sum').show(50)

But the returning result is not what I want on cum_sum column. It keep the sum from the last sum value.

+----------+---+-------------------+--------+-------+
|      date| id|               diff|   delay|cum_sum|
+----------+---+-------------------+--------+-------+
|2019-03-25| 01|                0.0|       0|      0|
|2019-03-26| 01|-0.8009825176018239|       0|      0|
|2019-03-27| 01| -8.043299057243985|       0|      0|
|2019-03-28| 01|  10.44232245458943|       1|      1|
|2019-03-29| 01| -6.431268561932029|       0|      0|
|2019-03-30| 01| 12.057037757827857|       1|      2|
|2019-03-31| 01|  30.57613602073252|       1|      3|
|2019-04-01| 01| 23.813922489788638|       1|      4|
|2019-04-02| 01|  -36.4545666689105|       0|      0|
|2019-04-03| 01|  -65.2062636145838|       0|      0|
|2019-03-25| 02|                0.0|       0|      0|
|2019-03-26| 02| 19.404830156549906|       1|      1|
|2019-03-27| 02|  38.82952192094308|       1|      2|
|2019-03-28| 02| -4.405904377682418|       0|      0|
|2019-03-29| 02| 14.994416162227935|       1|      3|
|2019-03-30| 02|  34.41459369421227|       1|      4|
|2019-03-31| 02|  53.85464854268412|       1|      5|
|2019-04-01| 02|  37.27460105285991|       1|      6|
|2019-04-02| 02| -6.717583231493336|       0|      0|
|2019-04-03| 02| -7.210466068066125|       0|      0|
|2019-04-04| 02|  29.99314720820803|       1|      7|
|2019-04-05| 02|  23.96867653208119|       1|      8|
|2019-04-06| 02| -43.39803957296772|       0|      0|
+----------+---+-------------------+--------+-------+

What I really need is this cum_sum result, restarting the sums when delay == 1 after delay == 0.

+----------+---+-------------------+--------+-------+
|      date| id|               diff|   delay|cum_sum|
+----------+---+-------------------+--------+-------+
|2019-03-25| 01|                0.0|       0|      0|
|2019-03-26| 01|-0.8009825176018239|       0|      0|
|2019-03-27| 01| -8.043299057243985|       0|      0|
|2019-03-28| 01|  10.44232245458943|       1|      1|
|2019-03-29| 01| -6.431268561932029|       0|      0|
|2019-03-30| 01| 12.057037757827857|       1|      1|
|2019-03-31| 01|  30.57613602073252|       1|      2|
|2019-04-01| 01|-23.813922489788638|       0|      0|
|2019-04-02| 01|  -36.4545666689105|       0|      0|
|2019-04-03| 01|  -65.2062636145838|       0|      0|
|2019-03-25| 02|                0.0|       0|      0|
|2019-03-26| 02| 19.404830156549906|       1|      1|
|2019-03-27| 02|  38.82952192094308|       1|      2|
|2019-03-28| 02| -4.405904377682418|       0|      0|
|2019-03-29| 02| 14.994416162227935|       1|      1|
|2019-03-30| 02|  34.41459369421227|       1|      2|
|2019-03-31| 02|  53.85464854268412|       1|      3|
|2019-04-01| 02|  37.27460105285991|       1|      4|
|2019-04-02| 02| 56.717583231493336|       1|      5|
|2019-04-03| 02| 27.210466068066125|       1|      6|
|2019-04-04| 02|  29.99314720820803|       1|      7|
|2019-04-05| 02| -23.96867653208119|       0|      0|
|2019-04-06| 02| -43.39803957296772|       0|      0|
+----------+---+-------------------+--------+-------+

Can anyone help me to achieve this?

Many thanks!

Bruno Pinheiro
  • 964
  • 8
  • 20
  • Hey @samkart. Thanks! Whatever way to achieve this is very welcome. I'm not an expert in Spark. Can you give tips on how would I can perform using rdd transformations? – Bruno Pinheiro Jan 22 '20 at 15:05
  • Does this answer your question? [Spark SQL window function with complex condition](https://stackoverflow.com/questions/42448564/spark-sql-window-function-with-complex-condition) – user10938362 Jan 22 '20 at 16:17
  • Not exactly @user10938362. My need was to iterate over an Id by date and reset the operation based on condition in other column. But I found the answer here https://stackoverflow.com/questions/56384625/pyspark-cumulative-sum-with-reset-condition – Bruno Pinheiro Jan 22 '20 at 20:38

0 Answers0