1

I have an RDD of type RDD[(Int,Double)] in which the first element of the pair is the index and the second is the value and I'd like to convert this RDD to a Vector to use for classification. Could someone help me with that?

I have the following code but it's not working

  def vectorize(x:RDD[(Int,Double)], size: Int):Vector = {
   val vec = Vectors.sparse(size,x)
 }
zero323
  • 322,348
  • 103
  • 959
  • 935
HHH
  • 6,085
  • 20
  • 92
  • 164

1 Answers1

2

Since org.apache.spark.mllib.linalg.Vector is a local data structure you have to collect your data.

def vectorize(x:RDD[(Int,Double)], size: Int):Vector = {
  Vectors.sparse(size, x.collect)
}

Since there is no data distribution you have to be sure output will fit in a driver memory.

In general this operation is not particularly useful. If your data can be easily handled using local data structures then it probably shouldn't be stored inside RDD in the first place.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Is this the only way to do this conversion? – HHH Aug 05 '15 at 21:16
  • If you ask about `collect` part then as long as you require Vector as an output then answer is yes. – zero323 Aug 05 '15 at 21:23
  • No, I'm asking the method I have the only way to convert an RDD to Vector? – HHH Aug 05 '15 at 21:24
  • Only reasonable. Output of `collect` has a right type so there is really nothing else to do here. So `collect` is the only missing part in your code. – zero323 Aug 05 '15 at 21:27