1

I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?

For example, input data is like

 (1,[0.1,0.2,0.7])
 (2,[0.2,0.4,0.4])

It then produces results like [0.3,0.6,1.1]

regardless of the first value in long

HappyCoding
  • 5,029
  • 7
  • 31
  • 51

2 Answers2

3

If you have an RDD[Long, Vector] like this:

val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))

You can reduce the values (vectors) in order to get the sum:

 val res = myRdd
  .values
  .reduce {case (a:(Vector), b:(Vector)) => 
    Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}

I get the following result with a floating point error:

[0.30000000000000004,0.6000000000000001,1.1]

source: this

Community
  • 1
  • 1
drstein
  • 1,113
  • 1
  • 10
  • 24
  • May I know what "zipped" does? – HappyCoding Feb 25 '16 at 14:17
  • I can reproduce the given solution. However, when performing on my RDD, it shows error "type mismatch, expected:(vector, vector)=>vector, actual:(vector,vector)=> indexed seq[any]". – HappyCoding Feb 25 '16 at 14:18
  • my data type is RDD([long, verctor]) – HappyCoding Feb 25 '16 at 14:19
  • are your vectors member of scala.collection.immutable.Vector? I suspect we're not using the same collection type. – drstein Feb 25 '16 at 14:25
  • just checked, the vector I used is from org.apache.spark.mllib.linalg, i.e., `sealed trait Vector extends Serializable` – HappyCoding Feb 25 '16 at 14:29
  • is this post related? http://stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors @vzamboni. I think the addition operation should use the one work for the vector in Spark, right? – HappyCoding Feb 25 '16 at 14:32
  • thanks! follow the insights, I just implemented a similar solution. '{case (a:(Vector), b:(Vector)) => Vectors.dense((a.toArray, b.toArray) .zipped .map(_ + _))}' – HappyCoding Feb 25 '16 at 15:01
  • great! glad it helped! i edit the answer with the final solution. – drstein Feb 25 '16 at 15:05
0

you can refer spark example,about:

 val model = pipeline.fit(df)
val documents = model.transform(df)
  .select("features")
  .rdd
  .map { case Row(features: MLVector) => Vectors.fromML(features) }
  .zipWithIndex()
  .map(_.swap)
(documents,
  model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary, 
  //vocabulary
  documents.map(_._2.numActives).sum().toLong)
  //total token count