1

I would like to read from a huge csv file, assign every row to a vector via spliting values by ",". In the end I aim to have an RDD of Vectors which holds the values. However I get an error after Seq:

type mismatch; found : Unit required: org.apache.spark.mllib.linalg.Vector Error occurred in an application involving default arguments.

My code is like this so far:

val file = "/data.csv"
val data: RDD[Vector] =sc.parallelize(
  Seq(
    for(line <- Source.fromFile(file).getLines){
      Vectors.dense(line.split (",").map (_.toDouble).distinct)
    }
  )
)
Tolga
  • 116
  • 2
  • 12

1 Answers1

1

You should read it using sparkContext's textFile api as

val file = "/data.csv"
val data = sc.textFile(file).map(line => Vectors.dense(line.split (",").map(_.toDouble).distinct))

And you should get org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]

But if you are looking for RDD[Vector[Double]] then you can simply do

val file = "/data.csv"
val data = sc.textFile(file).map(line => line.split (",").map(_.toDouble).distinct.toVector)
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Whit this way does it assign each line to a vector in rdd? – Tolga Mar 24 '18 at 12:50
  • I tried first one and it actually works. However when I want to measure the correlation between vectors then I realized I must transpose the matrix since each vectors holds a record, not attributes. Is it possible to transpose? – Tolga Mar 24 '18 at 13:42
  • when an answer works for you and is helpful to reach to your next step you should consider accepting the answer (and upvote too) and then ask another question for the next level errors. – Ramesh Maharjan Mar 24 '18 at 13:45
  • but before you ask a question do a thorough research and try a lot yourself. – Ramesh Maharjan Mar 24 '18 at 13:47
  • Sorry, I did not know that people do not answer next question without accept. Thanks for the information, I accepted it. – Tolga Mar 24 '18 at 13:47