0

Let's say I have a list of numerics:

val list = List(4,12,3,6,9)

For every element in the list, I need to find the rolling sum, i,e. the final output should be:

List(4, 16, 19, 25, 34)

Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both? Something like map(initial)((curr,prev) => curr+prev)

I want to achieve this without maintaining any shared global state.

EDIT: I would like to be able to do the same kinds of computation on RDDs.

Ankit Khettry
  • 997
  • 1
  • 13
  • 33

3 Answers3

4

You may use scanLeft

list.scanLeft(0)(_ + _).tail
rstar
  • 1,006
  • 1
  • 8
  • 14
  • Thanks! This one worked. However, I want to achieve the same thing on spark RDDs and scanLeft doesn't seem to be implemented on RDDs. Can a similar operation be done on RDDs as well? – Ankit Khettry Jun 13 '17 at 09:53
  • Here is a description and an implementation of `scanLeft` for RDD: http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ – Dima Jun 13 '17 at 10:52
1

The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.

import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD

def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
  val num = implicitly[Numeric[N]]
  val nPartitions = rdd.partitions.length

  val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) => 
    if (index == nPartitions - 1) Iterator.empty
    else Iterator.single(iter.foldLeft(num.zero)(num.plus))
  ).collect
   .scanLeft(num.zero)(num.plus)

  rdd.mapPartitionsWithIndex((index, iter) => 
    if (iter.isEmpty) iter
    else {
      val start = num.plus(partitionCumSums(index), iter.next)
      iter.scanLeft(start)(num.plus)
    }
  )
}

It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.

Jason Scott Lenderman
  • 1,908
  • 11
  • 14
0

I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):

list.zipWithIndex.map{x => list.take(x._2+1).sum}

This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).

When printing it, I get the following:

List(4, 16, 19, 25, 34)
Thomas Böhm
  • 1,456
  • 1
  • 15
  • 27
  • Are you sure this would work for a parallel collection? Will a `parList.take(n)` always return the first n elements? Or will it return n elements randomly? Also, this would be computationally expensive, since you are basically adding the first n elements by brute force for each iteration in the loop. Misapplication of functional programming in my view. – Ankit Khettry Jun 16 '17 at 06:58
  • This is an O(n) problem but has you are using an O(n^2) algorithm. – Ankit Khettry Jun 16 '17 at 07:04