Pyspark cumsum over same values in orderBy column

Question

I have the following dataframe:

+----+----+-----+
|col1|col2|value|
+----+----+-----+
|  11|   a|    1|
|  11|   a|    2|
|  11|   b|    3|
|  11|   a|    4|
|  11|   b|    5|
|  22|   a|    6|
|  22|   b|    7|
+----+----+-----+

I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.

This is the desired output:

+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
|  11|   a|    1|     1|
|  11|   a|    2|     3|
|  11|   a|    4|     7|
|  11|   b|    3|    10|
|  11|   b|    5|    15|
|  22|   a|    6|     6|
|  22|   b|    7|    13|
+----+----+-----+------+

I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?

df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
|  11|   a|    2|     7|
|  11|   a|    1|     7|
|  11|   a|    4|     7|
|  11|   b|    3|    15|
|  11|   b|    5|    15|
|  22|   a|    6|     6|
|  22|   b|    7|    13|
+----+----+-----+------+

score 0 · Accepted Answer · answered Mar 18 '20 at 20:06

You have to use .rowsBetween instead of .rangeBetween in your window clause.

rowsBetween (vs) rangeBetween

Example:

df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()

#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#|  11|   a|    1|     1|
#|  11|   a|    2|     3|
#|  11|   a|    4|     7|
#|  11|   b|    3|    10|
#|  11|   b|    5|    15|
#|  12|   a|    6|     6|
#|  12|   b|    7|    13|
#+----+----+-----+------+

Pyspark cumsum over same values in orderBy column

1 Answers1