how to calculate cumulative sum in a pyspark table

Question

I have a table using the crosstab function on pyspark, something like this:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
                             ["time", "value", "class"] )

tabla = df.crosstab("value","class")
tabla.withColumn("Total",tabla.a + tabla.b).show()


+-----------+---+---+-----+
|value_class|  a|  b|Total|
+-----------+---+---+-----+
|          2|  4|  0|    4|
|          4|  1|  2|    3|
|          3|  1|  4|    5|
+-----------+---+---+-----+

I need to aggregate a new column which indicates the cumulative sum from "total"

Possible duplicate of [Python Spark Cumulative Sum by Group Using DataFrame](https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe) — snark, Jan 22 '19 at 15:32

pissall · Answer 1 · 2020-01-04T03:40:58.963

0

Hope this helps :

I just gave an example, but you can use partitionBy, orderBy, etc to make the window.

from pyspark.sql.window import *
window = Window.partitionBy("value_class")
tabla = tabla.withColumn("CumSumTotal", sum(tabla.Total).over(window))

edited Jan 04 '20 at 03:40

answered Oct 28 '17 at 06:18

pissall

7,109
2
25
45

Hi, i tried the solution but shows me an error: Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.8.5-1.cdh5.8.5.p0.5/lib/spark/python/pyspark/sql/column.py", line 243, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable What may it be? – Juan David Oct 30 '17 at 15:48
There's a small typo; should be `tabla = tabla.withColumn("CumSumTotal", sum(tabla.Total).over(window))` – creativename Jan 03 '20 at 15:54

how to calculate cumulative sum in a pyspark table

1 Answers1