2

I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array. I use this code for the moment :

var rdd1 = RDD[Array[Double]]

var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) })

Is there a better way to make this more efficiently because it cost a harm.

KyBe
  • 842
  • 1
  • 14
  • 33
  • Can you quantify "cost a harm"? This looks like O(mn) where M is the RDD length and n is the array length. The zip part is where you could probably improve since you first have to construct an array of (x,y) pairs. – Chris Scott Jun 24 '15 at 12:29
  • How big are your arrays? If they are just 2 or 3 elements, it's a completely different situation to if they are large. – Rüdiger Klaehn Jun 24 '15 at 12:59
  • They are between 3 to 1000 it's depend the dataset. – KyBe Jun 24 '15 at 13:01

2 Answers2

4

Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles.

If you use spire, you can just do this

> import spire.implicits._
> val rdd1 = sc.parallelize(Seq(Array(1.0, 2.0), Array(3.0, 4.0)))
> var coord = rdd1.reduce( _ + _)
res1: Array[Double] = Array(4.0, 6.0)

This is much nicer to look at, and should also be much more efficient.

Spire is a dependency of spark, so you should be able to do the above without any extra dependencies. At least it worked with a spark-shell for spark 1.3.1 here.

This will work for any array where there is an AdditiveSemigroup typeclass instance available for the element type. In this case, the element type is Double. Spire typeclasses are @specialized for double, so there will be no boxing going on anywhere.

If you really want to know what is going on to make this work, you have to use reify:

> import scala.reflect.runtime.{universe => u}
> val a = Array(1.0, 2.0)
> val b = Array(3.0, 4.0)
> u.reify { a + b }

res5: reflect.runtime.universe.Expr[Array[Double]] = Expr[scala.Array[Double]](
  implicits.additiveSemigroupOps(a)(
    implicits.ArrayNormedVectorSpace(
      implicits.DoubleAlgebra, 
      implicits.DoubleAlgebra,
      Predef.this.implicitly)).$plus(b))

So the addition works because there is an instance of AdditiveSemigroup for Array[Double].

Rüdiger Klaehn
  • 12,445
  • 3
  • 41
  • 57
  • 2
    As someone who doesn't know spire, can you explain how it reduces both down the RDD and across each array? – Chris Scott Jun 24 '15 at 12:32
  • It look better but unfortunately it's a little bit slowly on execution. – KyBe Jun 24 '15 at 12:38
  • It uses implicits to provide suitably specialized implementations of operations. It might have one for `Array[Double]`, or a generic one for `Array` that will delegate to that for `Double`. Easiest way to see exactly what's going on is to put that code in an IDE and look through the tree of implicits that are used. – lmm Jun 24 '15 at 12:39
  • I assume spire is parallelizing on a single host via threads? or is it tied to spark to push the work out as transformations on an RDD? – Angelo Genovese Jun 24 '15 at 12:44
  • spire does not parallelize anything. It is just using specialization to avoid boxing. I thought the question was how to get the reduce operation as quick and concise as possible. – Rüdiger Klaehn Jun 24 '15 at 12:55
  • @KyBe can you quantify what you mean by slow? How many arrays of what size, and what is the runtime? (I assume all arrays are of the same size) – Rüdiger Klaehn Jun 24 '15 at 13:09
  • Yes arrays are of the same size. It took 10s more than the previous version for a dataset of size 500 with 32 columns. – KyBe Jun 24 '15 at 13:12
  • So summing 500 arrays of 32 elements each? That should be too fast to measure in any case. Takes less than 1s on my machine. – Rüdiger Klaehn Jun 24 '15 at 13:26
0

I assume the concern is that you have very large Array[Double] and the transformation as written does not distribute the addition of them. If so, you could do something like (untested):

// map Array[Double] to (index, double)
val rdd2 = rdd1.flatMap(a => a.zipWithIndex.map(t => (t._2,t._1))
// get the sum for each index
val reduced = rdd2.reduceByKey(_ + _)
// key everything the same to get a single iterable in groubByKey
val groupAll = reduced.map(t => ("constKey", (t._1, t._2)
// get the doubles back together into an array
val coord = groupAll.groupByKey { (k,vs) => 
                     vs.toList.sortBy(_._1).toArray.map(_._2) }
Angelo Genovese
  • 3,398
  • 17
  • 23
  • if, instead of keying everything the same, you converted them to individual arrays of Array[Option[Double]] you might be able to collect them together just using reduce. If I have time a bit later I'll try to add that to the answer. – Angelo Genovese Jun 24 '15 at 12:43