Functional Programming way to calculate something like a rolling sum

Question

Let's say I have a list of numerics:

val list = List(4,12,3,6,9)

For every element in the list, I need to find the rolling sum, i,e. the final output should be:

List(4, 16, 19, 25, 34)

Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both? Something like map(initial)((curr,prev) => curr+prev)

I want to achieve this without maintaining any shared global state.

EDIT: I would like to be able to do the same kinds of computation on RDDs.

score 4 · Answer 1 · answered Jun 13 '17 at 09:35

4

You may use scanLeft

list.scanLeft(0)(_ + _).tail

answered Jun 13 '17 at 09:35

rstar

1,006
1
8
14

Thanks! This one worked. However, I want to achieve the same thing on spark RDDs and scanLeft doesn't seem to be implemented on RDDs. Can a similar operation be done on RDDs as well? – Ankit Khettry Jun 13 '17 at 09:53
Here is a description and an implementation of `scanLeft` for RDD: http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ – Dima Jun 13 '17 at 10:52

Jason Scott Lenderman · Accepted Answer · 2017-06-16T09:35:41.247

The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.

import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD

def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
  val num = implicitly[Numeric[N]]
  val nPartitions = rdd.partitions.length

  val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) => 
    if (index == nPartitions - 1) Iterator.empty
    else Iterator.single(iter.foldLeft(num.zero)(num.plus))
  ).collect
   .scanLeft(num.zero)(num.plus)

  rdd.mapPartitionsWithIndex((index, iter) => 
    if (iter.isEmpty) iter
    else {
      val start = num.plus(partitionCumSums(index), iter.next)
      iter.scanLeft(start)(num.plus)
    }
  )
}

It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.

Well I tried this: `cumSum(sc.parallelize(List(1,4,2,5,3)))` but got an Exception `java.lang.UnsupportedOperationException: empty.reduceLeft`. Looks like its trying to reduce an empty collection — Ankit Khettry, Jun 15 '17 at 21:31
This probably happened because some of the partitions of your `RDD` were empty. I modified the code a little. Hopefully it will work now. — Jason Scott Lenderman, Jun 16 '17 at 00:52

score 0 · Answer 3 · answered Jun 13 '17 at 10:25

0

I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):

list.zipWithIndex.map{x => list.take(x._2+1).sum}

This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).

When printing it, I get the following:

List(4, 16, 19, 25, 34)

answered Jun 13 '17 at 10:25

Thomas Böhm

1,456
1
15
27

Are you sure this would work for a parallel collection? Will a `parList.take(n)` always return the first n elements? Or will it return n elements randomly? Also, this would be computationally expensive, since you are basically adding the first n elements by brute force for each iteration in the loop. Misapplication of functional programming in my view. – Ankit Khettry Jun 16 '17 at 06:58
This is an O(n) problem but has you are using an O(n^2) algorithm. – Ankit Khettry Jun 16 '17 at 07:04

Functional Programming way to calculate something like a rolling sum

3 Answers3