How to compute cumulative sum on multiple float columns?

Question

I have 100 float columns in a Dataframe which are ordered by date.

ID   Date         C1       C2 ....... C100
1     02/06/2019   32.09  45.06         99
1     02/04/2019   32.09  45.06         99
2     02/03/2019   32.09  45.06         99
2     05/07/2019   32.09  45.06         99

I need to get C1 to C100 in the cumulative sum based on id and date.

Target dataframe should look like this:

ID   Date         C1       C2 ....... C100
1     02/04/2019   32.09  45.06         99
1     02/06/2019   64.18  90.12         198
2     02/03/2019   32.09  45.06         99
2     05/07/2019   64.18  90.12         198

I want to achieve this without looping from C1- C100.

Initial code for one column:

var DF1 =  DF.withColumn("CumSum_c1", sum("C1").over(
         Window.partitionBy("ID")
        .orderBy(col("date").asc)))

I found a similar question here but he manually did it for two columns : Cumulative sum in Spark

Did you get an answer to this question ? – Leothorn Jan 31 '20 at 13:29 — Leothorn, Jan 31 '20 at 13:29

score 5 · Answer 1 · edited Jan 31 '20 at 14:25

Its a classical use for foldLeft. Let's generate some data first :

import org.apache.spark.sql.expressions._

val df = spark.range(1000)
            .withColumn("c1", 'id + 3)
            .withColumn("c2", 'id % 2 + 1)
            .withColumn("date", monotonically_increasing_id)
            .withColumn("id", 'id % 10 + 1)

// We will select the columns we want to compute the cumulative sum of.       
val columns = df.drop("id", "date").columns

val w = Window.partitionBy(col("id")).orderBy(col("date").asc) 

val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))

results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2|       date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// |  1|  3|  1|          0|         3|         1|
// |  1| 13|  1|         10|        16|         2|
// |  1| 23|  1|         20|        39|         3|
// |  1| 33|  1|         30|        72|         4|
// |  1| 43|  1|         40|       115|         5|
// |  1| 53|  1| 8589934592|       168|         6|
// |  1| 63|  1| 8589934602|       231|         7|

score 1 · Accepted Answer · answered Jan 31 '20 at 14:38

Here is another way using simple select expression :

val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow) 

// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns

// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq

df.select(selectExpr:_*).show()

Gives:

+---+----------+-----+-----+----+                                               
| ID|      Date|   C1|   C2|C100|
+---+----------+-----+-----+----+
|  1|02/04/2019|32.09|45.06|  99|
|  1|02/06/2019|64.18|90.12| 198|
|  2|02/03/2019|32.09|45.06|  99|
|  2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

How to compute cumulative sum on multiple float columns?

2 Answers2