1

I am working in the Scala Spark Shell and have the following RDD:

scala> docsWithFeatures
res10: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[162] at repartition at <console>:9

I previously saved this to text using:

docsWithFeatures.saveAsTextFile("path/to/file")

Here's an example line from the text file (which I've shortened here for readability):

(22246418,(112312,[4,11,14,15,19,...],[109.0,37.0,1.0,3.0,600.0,...]))

Now, I know I could have saved this as object file to simplify things, but the raw text format is better for my purposes.

My question is, what is the proper way to get this text file back into an RDD of the same format as above (i.e. RDD of (integer, sparse vector) tuples)? I'm assuming I jut need to load with sc.textFile and then apply a couple mapping functions, but I'm very new to Scala and not sure how to go about it.

zero323
  • 322,348
  • 103
  • 959
  • 935
moustachio
  • 2,924
  • 3
  • 36
  • 68

1 Answers1

3

A simple regular expression and built-in vector utilities should do the trick:

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

def parse(rdd: RDD[String]): RDD[(Long, Vector)] = {
  val pattern: scala.util.matching.Regex = "\\(([0-9]+),(.*)\\)".r
  rdd .map{
    case pattern(k, v) => (k.toLong, Vectors.parse(v))
  }
}

Example usage:

val docsWithFeatures = sc.parallelize(Seq(
  "(22246418,(4,[1],[2.0]))", "(312332123,(3,[0,2],[-1.0,1.0]))"))\

parse(docsWithFeatures).collect
// Array[(Long, org.apache.spark.mllib.linalg.Vector)] =
//   Array((22246418,(4,[1],[2.0])), (312332123,(3,[0,2],[-1.0,1.0])))
zero323
  • 322,348
  • 103
  • 959
  • 935