I have some Mahout vectors in my hdfs in sequence file format. Is it possible to use the same vectors in some way to train a KMeans model in Spark? I could just convert the existing Mahout vectors into Spark vectors (mllib) but I'd like to avoid that.
Asked
Active
Viewed 187 times
1
1 Answers
1
Mahout vectors are not directly supported by Spark. You would - along the lines of your concern - need to convert them to Spark Vectors.
val sc = new SparkContext("local[2]", "MahoutTest")
val sfData = sc.sequenceFile[NullWritable, MVector](dir)
val xformedVectors = sfData.map { case (label, vect) =>
import collection.JavaConversions._
(label, Vectors.dense(vect.all.iterator.map{ e => e.get}.toArray))
}

WestCoastProjects
- 58,982
- 91
- 316
- 560
-
This is really not so bad. a single distributed pass over the DRM is fast. When using the Spark-Mahout code there is no need for the Sequence file either. – pferrel Feb 08 '15 at 16:18