1

I have an RDD of the following format and would like to convert it into a LabeledPoint RDD in order to process it in mllib :

Test: RDD[(Int, Seq[Double])] = Array((1,List(1.0,3.0,8.0),(2,List(3.0, 3.0,8.0),(1,List(2.0,3.0,7.0),(1,List(5.0,5.0,9.0))

I tried with map

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
Test.map(x=> LabeledPoint(x._1, Vectors.sparse(x._2)))

but I get this error

mllib.linalg.Vector cannot be applied to (Seq[scala.Double])

So presumably the Seq element needs to be converted first but I don't know into what.

zero323
  • 322,348
  • 103
  • 959
  • 935
ulrich
  • 3,547
  • 5
  • 35
  • 49

2 Answers2

1

There are a few problems here:

  • label should be Double not Int
  • SparseVector requires number of elements, indices and values
  • none of the vector constructors accepts list of Double
  • your data looks dense not sparse

One possible solution:

val rdd = sc.parallelize(Array(
    (1, List(1.0,3.0,8.0)),
    (2, List(3.0, 3.0,8.0)),
    (1, List(2.0,3.0,7.0)),
    (1, List(5.0,5.0,9.0))))

rdd.map { case (k, vs) => 
  LabeledPoint(k.toDouble, Vectors.dense(vs.toArray))
}

and another:

rdd.collect { case (k, v::vs) =>
  LabeledPoint(k.toDouble, Vectors.dense(v, vs: _*)) }
zero323
  • 322,348
  • 103
  • 959
  • 935
1

As you can notice in LabeledPoint's documentation its constructor receives a Double as a label and a Vector as features (DenseVector or SparseVector). However, if you take a look in both inherited classes' constructors they receive an Array, therefore you need to convert your Seq to Array.

import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint

val rdd = sc.parallelize(Array((1, Seq(1.0,3.0,8.0)), 
                               (2, Seq(3.0, 3.0,8.0)),
                               (1, Seq(2.0,3.0, 7.0)),
                               (1, Seq(5.0, 5.0, 9.0))))
val x = rdd.map{
    case (a: Int, b:Seq[Double]) => LabeledPoint(a, new DenseVector(b.toArray))
}

x.take(2).foreach(println)

//(1.0,[1.0,3.0,8.0])
//(2.0,[3.0,3.0,8.0])
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93