0

I have a dataframe with group, value, date_start, and date_end. I want to take the cumulative sum of all values:

  • partition by group
  • that end before the current date_start

Here is how the data looks like:

+-----+-----+----------+----------+
|group |value|date_start|date_end |
+-----+-----+----------+----------+
|a    |1    |2016-05-04|2016-05-05|
|a    |2    |2016-05-06|2016-05-06|
|a    |5    |2016-07-06|2016-10-06|
|a    |2    |2016-09-10|2016-09-20|
|a    |3    |2016-11-12|2016-12-20|
|b    |8    |2016-09-03|2016-11-06|
|b    |2    |2016-11-04|2016-12-05|
|b    |4    |2016-12-04|2016-12-06|
+-----+-----+----------+----------+

This is what I expected:

+-----+-----+----------+----------+-------+
|group|value|date_start|date_end  |cum_sum|
+-----+-----+----------+----------+-------+
|a    |1    |2016-05-04|2016-05-05|      0|
|a    |2    |2016-05-06|2016-05-06|      1|
|a    |5    |2016-07-06|2016-10-06|      3| => 1+2
|a    |2    |2016-09-10|2016-09-20|      3| => 1+2; not include the 3rd row
|a    |3    |2016-11-12|2016-12-20|     10| => 1+2+5+2
|b    |8    |2016-09-03|2016-11-06|      0| => no sample satisfies the time condition
|b    |2    |2016-11-04|2016-12-05|      0| => no sample satisfies the time condition
|b    |4    |2016-12-04|2016-12-06|      8|
+-----+-----+----------+----------+-------+

I'm currently setting the window functions as this:

from pyspark.sql import Window
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

windowval = (Window.partitionBy('group').orderBy(F.col("date_start"))
             .rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', F.sum('value').over(windowval) - F.col('value'))

Here's the result I got. I know it's not correct, but it just simply takes the cum_sum along the partition.

+-----+-----+----------+----------+-------+
|group|value|date_start|date_end  |cum_sum|
+-----+-----+----------+----------+-------+
|a    |1    |2016-05-04|2016-05-05|      0|
|a    |2    |2016-05-06|2016-05-06|      1|
|a    |5    |2016-07-06|2016-10-06|      3|
|a    |2    |2016-09-10|2016-09-20|      8|
|a    |3    |2016-11-12|2016-12-20|     10|
|b    |8    |2016-09-03|2016-11-06|      0|
|b    |2    |2016-11-04|2016-12-05|      8|
|b    |4    |2016-12-04|2016-12-06|     12|
+-----+-----+----------+----------+-------+

Is there a way to apply the condition on the date_start and date_end?

Mario
  • 1,631
  • 2
  • 21
  • 51
Hau Phan
  • 1
  • 1

0 Answers0