Scala groupBy to get a RDD[String, vector]

Question

I have an RDD in the form RDD[((ID, code),value)]

Example RDD:

((00001, 234) 7.0)
((00001, 456) 6.0)
((00001, 467) 3.0)
((00002, 245) 8.0)
((00002, 765) 9.0)
...

The Expected result RDD[String, Vectors.dense(...))

Example:

(00001, vector(7.0, 6.0, 3.0))
(00002, vector(8.0, 9.0))

I have tried the following:

val vectRDD = InRDD.groupBy(f => f._1._1)
  .map(m => (m._1, Vectors.dense(m._2._2)))

But get the below error:

value _2 is not a member of Iterable

Suggestions?

score 2 · Accepted Answer · answered Oct 01 '18 at 20:53

2

You're almost there – just missing the inner map from the 2nd tuple element to assemble the DenseVector:

import org.apache.spark.ml.linalg.Vectors

val rdd = sc.parallelize(Seq(
  (("00001", 234), 7.0),
  (("00001", 456), 6.0),
  (("00001", 467), 3.0),
  (("00002", 245), 8.0),
  (("00002", 765), 9.0)
))

rdd.
  groupBy(_._1._1).
  map(t => (t._1, Vectors.dense(t._2.map(_._2).toArray))).
  collect
// res1: Array[(String, org.apache.spark.ml.linalg.Vector)] =
//   Array((00001,[7.0,6.0,3.0]), (00002,[8.0,9.0]))

Note that Vector.dense takes an Array[Double], hence the toArray.

answered Oct 01 '18 at 20:53

Leo C

22,006
3
26
39

Leo C., Thanks again! I figured it out and had not looked here, but my code looks ugly compared to yours. One thing, I did't use collect. – Al. Oct 01 '18 at 21:04
looks like I need more, of something, to vote on things here. Tried to give an up vote on this one and the other you help me with. – Al. Oct 01 '18 at 21:06
@Al., I ran `collect` only to show the result of the minuscule sample data. Normally you wouldn't perform `collect` unless you want to return all data to the driver program node. To learn more about `collect`, here's a [SO link](https://stackoverflow.com/questions/44174747/spark-dataframe-collect-vs-select). – Leo C Oct 01 '18 at 21:45
@Loe C., thanks for the link! I had a suspicion that collect had to do with collections. Oh, and can up vote now :) – Al. Oct 01 '18 at 22:05

Scala groupBy to get a RDD[String, vector]

1 Answers1