0

I have a table using the crosstab function on pyspark, something like this:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
                             ["time", "value", "class"] )

tabla = df.crosstab("value","class")
tabla.withColumn("Total",tabla.a + tabla.b).show()


+-----------+---+---+-----+
|value_class|  a|  b|Total|
+-----------+---+---+-----+
|          2|  4|  0|    4|
|          4|  1|  2|    3|
|          3|  1|  4|    5|
+-----------+---+---+-----+

I need to aggregate a new column which indicates the cumulative sum from "total"

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Juan David
  • 361
  • 1
  • 4
  • 15
  • Possible duplicate of [Python Spark Cumulative Sum by Group Using DataFrame](https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe) – snark Jan 22 '19 at 15:32

1 Answers1

0

Hope this helps :

I just gave an example, but you can use partitionBy, orderBy, etc to make the window.

from pyspark.sql.window import *
window = Window.partitionBy("value_class")
tabla = tabla.withColumn("CumSumTotal", sum(tabla.Total).over(window))
pissall
  • 7,109
  • 2
  • 25
  • 45
  • Hi, i tried the solution but shows me an error: Traceback (most recent call last): File "", line 1, in File "/opt/cloudera/parcels/CDH-5.8.5-1.cdh5.8.5.p0.5/lib/spark/python/pyspark/sql/column.py", line 243, in __iter__ raise TypeError("Column is not iterable") TypeError: Column is not iterable What may it be? – Juan David Oct 30 '17 at 15:48
  • There's a small typo; should be `tabla = tabla.withColumn("CumSumTotal", sum(tabla.Total).over(window))` – creativename Jan 03 '20 at 15:54