-1

I have a RDD of some mutable.Map[(Int, Array[Double])] and I would like to reduce the maps by Int and find the means of the elements of the arrays.

For example I have:

Map[(1, Array[0.1, 0.1]), (2, Array[0.3, 0.2])] 
Map[(1, Array[0.1, 0.4])]

What I want:

Map[(1, Array[0.1, 0.25]), (2, Array[0.3, 0.2])]

The problem is that I don't know how reduce works between maps and additionally I have to do it per partition, collect the results to the driver and reduce them there too. I found the foreachPartition method but I don't know if it is meant to be used in such cases.

Any ideas?

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
mkey
  • 55
  • 8

1 Answers1

1

You can do it using combineByKey :

val rdd = ss.sparkContext.parallelize(Seq(
  Map((1, Array(0.1, 0.1)), (2, Array(0.3, 0.2))),
  Map((1, Array(0.1, 0.4)))
))

// functions for combineByKey
val create = (arr: Array[Double]) => arr.map( x => (x,1))
val update = (acc : Array[(Double,Int)], current: Array[Double]) => acc.zip(current).map{case ((s,c),x) => (s+x,c+1)}
val merge =  (acc1 : Array[(Double,Int)],acc2:Array[(Double,Int)]) => acc1.zip(acc2).map{case ((s1,c1),(s2,c2)) => (s1+s2,c1+c2)}

val finalMap = rdd.flatMap(_.toList)
  // aggreate elementwise sum & count
  .combineByKey(create,update,merge)
  // calculate elementwise average per key
  .map{case (id,arr) => (id,arr.map{case (s,c) => s/c})}
  .collectAsMap()

// finalMap = Map(2 -> Array(0.3, 0.2), 1 -> Array(0.1, 0.25))
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145