I have a dataframe with group
, value
, date_start
, and date_end
. I want to take the cumulative sum of all values:
- partition by group
- that end before the current
date_start
Here is how the data looks like:
+-----+-----+----------+----------+
|group |value|date_start|date_end |
+-----+-----+----------+----------+
|a |1 |2016-05-04|2016-05-05|
|a |2 |2016-05-06|2016-05-06|
|a |5 |2016-07-06|2016-10-06|
|a |2 |2016-09-10|2016-09-20|
|a |3 |2016-11-12|2016-12-20|
|b |8 |2016-09-03|2016-11-06|
|b |2 |2016-11-04|2016-12-05|
|b |4 |2016-12-04|2016-12-06|
+-----+-----+----------+----------+
This is what I expected:
+-----+-----+----------+----------+-------+
|group|value|date_start|date_end |cum_sum|
+-----+-----+----------+----------+-------+
|a |1 |2016-05-04|2016-05-05| 0|
|a |2 |2016-05-06|2016-05-06| 1|
|a |5 |2016-07-06|2016-10-06| 3| => 1+2
|a |2 |2016-09-10|2016-09-20| 3| => 1+2; not include the 3rd row
|a |3 |2016-11-12|2016-12-20| 10| => 1+2+5+2
|b |8 |2016-09-03|2016-11-06| 0| => no sample satisfies the time condition
|b |2 |2016-11-04|2016-12-05| 0| => no sample satisfies the time condition
|b |4 |2016-12-04|2016-12-06| 8|
+-----+-----+----------+----------+-------+
I'm currently setting the window functions as this:
from pyspark.sql import Window
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
windowval = (Window.partitionBy('group').orderBy(F.col("date_start"))
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', F.sum('value').over(windowval) - F.col('value'))
Here's the result I got. I know it's not correct, but it just simply takes the cum_sum
along the partition.
+-----+-----+----------+----------+-------+
|group|value|date_start|date_end |cum_sum|
+-----+-----+----------+----------+-------+
|a |1 |2016-05-04|2016-05-05| 0|
|a |2 |2016-05-06|2016-05-06| 1|
|a |5 |2016-07-06|2016-10-06| 3|
|a |2 |2016-09-10|2016-09-20| 8|
|a |3 |2016-11-12|2016-12-20| 10|
|b |8 |2016-09-03|2016-11-06| 0|
|b |2 |2016-11-04|2016-12-05| 8|
|b |4 |2016-12-04|2016-12-06| 12|
+-----+-----+----------+----------+-------+
Is there a way to apply the condition on the date_start
and date_end
?