Spark MLlib RowMatrix from SparseVector

Question

I am trying to create a RowMatrix from an RDD of SparseVectors but am getting the following error:

<console>:37: error: type mismatch;
 found   : dataRows.type (with underlying type org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector])
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
Note: org.apache.spark.mllib.linalg.SparseVector <: org.apache.spark.mllib.linalg.Vector (and dataRows.type <: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.SparseVector]), but class RDD is invariant in type T.
You may wish to define T as +T instead. (SLS 4.5)
       val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)

My code is:

import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}

val DATA_FILE_DIR = "/user/cloudera/data/"
val DATA_FILE_NAME = "dataOct.txt"

val dataRows = sc.textFile(DATA_FILE_DIR.concat(DATA_FILE_NAME)).map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse)

val svd = new RowMatrix(dataRows.persist()).computeSVD(20, computeU = true)

My input data file is approximately 150 rows by 50,000 columns of space separated integers.

I am running:

Spark: Version 1.5.0-cdh5.5.1

Java: 1.7.0_67

score 1 · Accepted Answer · answered Feb 06 '16 at 11:20

1

Just provide explicit type annotation either for a RDD

val dataRows: org.apache.spark.rdd.RDD[Vector] = ???

or result of the anonymous function:

...
  .map(line => Vectors.dense(line.split(" ").map(_.toDouble)).toSparse: Vector)

answered Feb 06 '16 at 11:20

zero323

322,348
103
959
935

Spark MLlib RowMatrix from SparseVector

1 Answers1