Load RDD of sparse vectors from text file

Question

I am working in the Scala Spark Shell and have the following RDD:

scala> docsWithFeatures
res10: org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[162] at repartition at <console>:9

I previously saved this to text using:

docsWithFeatures.saveAsTextFile("path/to/file")

Here's an example line from the text file (which I've shortened here for readability):

(22246418,(112312,[4,11,14,15,19,...],[109.0,37.0,1.0,3.0,600.0,...]))

Now, I know I could have saved this as object file to simplify things, but the raw text format is better for my purposes.

My question is, what is the proper way to get this text file back into an RDD of the same format as above (i.e. RDD of (integer, sparse vector) tuples)? I'm assuming I jut need to load with sc.textFile and then apply a couple mapping functions, but I'm very new to Scala and not sure how to go about it.

Nope...this would have been straightforward for me in Python but I'm still pretty hopeless in scala... — moustachio, Nov 14 '15 at 16:54

zero323 · Accepted Answer · 2015-11-14T16:45:22.437

A simple regular expression and built-in vector utilities should do the trick:

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

def parse(rdd: RDD[String]): RDD[(Long, Vector)] = {
  val pattern: scala.util.matching.Regex = "\\(([0-9]+),(.*)\\)".r
  rdd .map{
    case pattern(k, v) => (k.toLong, Vectors.parse(v))
  }
}

Example usage:

val docsWithFeatures = sc.parallelize(Seq(
  "(22246418,(4,[1],[2.0]))", "(312332123,(3,[0,2],[-1.0,1.0]))"))\

parse(docsWithFeatures).collect
// Array[(Long, org.apache.spark.mllib.linalg.Vector)] =
//   Array((22246418,(4,[1],[2.0])), (312332123,(3,[0,2],[-1.0,1.0])))

Load RDD of sparse vectors from text file

1 Answers1