Python Spark Cumulative Sum by Group Using DataFrame

Question

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?

With an example dataset as follows:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], 
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+

I would like to add a cumulative sum column of value for each class grouping over the (ordered) time variable.

score 111 · Accepted Answer · edited Feb 20 '18 at 13:04

111

This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()

+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
|   1|    3|    b|      3|
|   2|    3|    b|      6|
|   1|    2|    a|      2|
|   2|    2|    a|      4|
|   3|    2|    a|      6|
+----+-----+-----+-------+

edited Feb 20 '18 at 13:04

galath

5,717
10
29
41

answered Aug 29 '17 at 18:50

mr kw

1,977
2
14
12

7

This answer is not correct. "rowsBetween" is the correct one to use instead of "rangeBetween". The cum sum will be incorrect if the "orderBy" column has duplicated values in each partition. – Tim Jan 28 '22 at 21:58
@Tim thanks, that saved my life! Was confused why it didn't work anymore (it worked before), changing it to rowsBetween solved the issues. – Annet Jun 14 '23 at 09:47

score 16 · Answer 2 · answered Feb 02 '22 at 23:07

To make an update from previous answers. The correct and precise way to do is :

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rowsBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()

score 5 · Answer 3 · edited May 17 '20 at 19:29

5

I have tried this way and it worked for me.

from pyspark.sql import Window

from pyspark.sql import functions as f

import sys

cum_sum = DF.withColumn('cumsum', f.sum('value').over(Window.partitionBy('class').orderBy('time').rowsBetween(-sys.maxsize, 0)))
cum_sum.show()

edited May 17 '20 at 19:29

Paul Roub

36,322
27
84
93

answered May 17 '20 at 19:21

Anubhav Raj

51
1
1

score 0 · Answer 4 · answered Nov 10 '22 at 16:04

I create this function in this link for my use: kolang/column_functions/cumulative_sum

def cumulative_sum(col: Union[Column, str],
                   on_col: Union[Column, str],
                   ascending: bool = True,
                   partition_by: Union[Column, str, List[Union[Column, str]]] = None) -> Column:
    on_col = on_col if ascending else F.desc(on_col)
    if partition_by is None:
        w = Window.orderBy(on_col).rangeBetween(Window.unboundedPreceding, 0)
    else:
        w = Window.partitionBy(partition_by).orderBy(on_col).rangeBetween(Window.unboundedPreceding, 0)
    return F.sum(col).over(w)

Python Spark Cumulative Sum by Group Using DataFrame

4 Answers4

Linked

Related