how to do vector (vertical) sum in scala with Spark1.6

Question

I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?

For example, input data is like

 (1,[0.1,0.2,0.7])
 (2,[0.2,0.4,0.4])

It then produces results like [0.3,0.6,1.1]

regardless of the first value in long

And what are the exact types here? – zero323 Feb 18 '16 at 11:26 — zero323, Feb 18 '16 at 11:26

score 3 · Accepted Answer · edited May 23 '17 at 12:31

3

If you have an RDD[Long, Vector] like this:

val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))

You can reduce the values (vectors) in order to get the sum:

 val res = myRdd
  .values
  .reduce {case (a:(Vector), b:(Vector)) => 
    Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}

I get the following result with a floating point error:

[0.30000000000000004,0.6000000000000001,1.1]

source: this

edited May 23 '17 at 12:31

Community

1
1

answered Feb 18 '16 at 09:10

drstein

1,113
1
10
24

May I know what "zipped" does? – HappyCoding Feb 25 '16 at 14:17
I can reproduce the given solution. However, when performing on my RDD, it shows error "type mismatch, expected:(vector, vector)=>vector, actual:(vector,vector)=> indexed seq[any]". – HappyCoding Feb 25 '16 at 14:18
my data type is RDD([long, verctor]) – HappyCoding Feb 25 '16 at 14:19
are your vectors member of scala.collection.immutable.Vector? I suspect we're not using the same collection type. – drstein Feb 25 '16 at 14:25
just checked, the vector I used is from org.apache.spark.mllib.linalg, i.e., `sealed trait Vector extends Serializable` – HappyCoding Feb 25 '16 at 14:29
is this post related? http://stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors @vzamboni. I think the addition operation should use the one work for the vector in Spark, right? – HappyCoding Feb 25 '16 at 14:32
thanks! follow the insights, I just implemented a similar solution. '{case (a:(Vector), b:(Vector)) => Vectors.dense((a.toArray, b.toArray) .zipped .map(_ + _))}' – HappyCoding Feb 25 '16 at 15:01
great! glad it helped! i edit the answer with the final solution. – drstein Feb 25 '16 at 15:05

score 0 · Answer 2 · answered May 24 '18 at 07:31

you can refer spark example,about:

 val model = pipeline.fit(df)
val documents = model.transform(df)
  .select("features")
  .rdd
  .map { case Row(features: MLVector) => Vectors.fromML(features) }
  .zipWithIndex()
  .map(_.swap)
(documents,
  model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary, 
  //vocabulary
  documents.map(_._2.numActives).sum().toLong)
  //total token count

how to do vector (vertical) sum in scala with Spark1.6

2 Answers2