Scala Reduce() operation on a RDD[Array[Int]]

Question

I have a RDD of a 1 dimensional matrix. I am trying to do a very basic reduce operation to sum up the values of the same position of the matrix from various partitions.

I am using:

var z=x.reduce((a,b)=>a+b)

or

var z=x.reduce(_ + _)

But I am getting an error saying: type mismatch; found Array[Int], expected:String

I looked it up and found the link Is there a better way for reduce operation on RDD[Array[Double]]

So I tried using import.spire.implicits._ So now I don't have any compilation error, but after running the code I am getting a java.lang.NoSuchMethodError. I have provided the entire error below. Any help would be appreciated.

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at spire.math.NumberTag$Integral$.<init>(NumberTag.scala:9)
at spire.math.NumberTag$Integral$.<clinit>(NumberTag.scala)
at spire.std.BigIntInstances.$init$(bigInt.scala:80)
at spire.implicits$.<init>(implicits.scala:6)
at spire.implicits$.<clinit>(implicits.scala)
at main.scala.com.ucr.edu.SparkScala.HistogramRDD$$anonfun$9.apply(HistogramRDD.scala:118)
at main.scala.com.ucr.edu.SparkScala.HistogramRDD$$anonfun$9.apply(HistogramRDD.scala:118)
at scala.collection.TraversableOnce$$anonfun$reduceLeft$1.apply(TraversableOnce.scala:190)
at scala.collection.TraversableOnce$$anonfun$reduceLeft$1.apply(TraversableOnce.scala:185)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1012)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Your reduction function seems to try to add two arrays, what you actually want is to add component wise. — Paul Georg Podlech, Jun 29 '18 at 12:31

score 0 · Accepted Answer · answered Jun 29 '18 at 13:31

0

From my understanding you're trying to reduce the items by position in the arrays. You should consider to zip your arrays while reducing the rdd :

val a: RDD[Array[Int]] = ss.createDataset[Array[Int]](Seq(Array(1,2,3), Array(4,5,6))).rdd

    a.reduce{case (a: Array[Int],b: Array[Int]) =>
        val ziped = a.zip(b)
        ziped.map{case (i1, i2) => i1 + i2}
    }.foreach(println)

outputs :

5
7
9

answered Jun 29 '18 at 13:31

baitmbarek

2,440
4
18
26

I am new to Scala as well as Spark. Hence a follow up question. Does the ziped function add any additional processing overhead? When I run this code in a cluster of 12 machines with dataset of around 23 GB it takes hours to process, which to me is way too long. It is this reduce function that takes maximum time. – SGh Jul 01 '18 at 05:44
The zip method applies to iterables and creates a new iterable so of course it adds processing and memory overhead but nothing significant in most cases. How long are the arrays ? How many cores do you have in your 12 machines cluster and how many records are you processing ? Do all records have the same size ? You may try changing the number of partitions. – baitmbarek Jul 01 '18 at 07:55

Scala Reduce() operation on a RDD[Array[Int]]

1 Answers1