1

I have some Mahout vectors in my hdfs in sequence file format. Is it possible to use the same vectors in some way to train a KMeans model in Spark? I could just convert the existing Mahout vectors into Spark vectors (mllib) but I'd like to avoid that.

zero323
  • 322,348
  • 103
  • 959
  • 935
IrishDog
  • 460
  • 1
  • 4
  • 21

1 Answers1

1

Mahout vectors are not directly supported by Spark. You would - along the lines of your concern - need to convert them to Spark Vectors.

val sc = new SparkContext("local[2]", "MahoutTest")
val sfData = sc.sequenceFile[NullWritable, MVector](dir)
val xformedVectors = sfData.map { case (label, vect) =>
  import collection.JavaConversions._
  (label, Vectors.dense(vect.all.iterator.map{ e => e.get}.toArray))
}
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • This is really not so bad. a single distributed pass over the DRM is fast. When using the Spark-Mahout code there is no need for the Sequence file either. – pferrel Feb 08 '15 at 16:18