0

How would I convert the following DataFrame

val df = Seq(
  (5.0, 1.0, 1.0, 3.0, 7.0),
  (2.0, 0.0, 3.0, 4.0, 5.0),
  (4.0, 0.0, 0.0, 6.0, 7.0)).toDF("m1", "m2", "m3", "m4", "m5")
//df: res166: org.apache.spark.sql.DataFrame = [m1: int, m2: int ... 3 more fields]

to an Array of dense vectors

val arrayDenseVectors = Array(
      Vectors.dense(5.0, 1.0, 1.0, 3.0, 7.0),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
//arrayDenseVectors: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,1.0,1.0,3.0,7.0], [2.0,0.0,3.0,4.0,5.0], [4.0,0.0,0.0,6.0,7.0])

To further complicate the issue, df columns are of type Int instead of Double

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
Amazonian
  • 391
  • 2
  • 8
  • 22

1 Answers1

0

Using map on RDD, you can convert each row into Vector then collect into an array:

import org.apache.spark.mllib.linalg.Vectors

val arrayDenseVectors = df.rdd.map { r =>
  Vectors.dense(Array((0 to 3).map(r.getAs[Double](_)): _*))
}.collect

//arrayDenseVectors: Array[org.apache.spark.ml.linalg.Vector] = Array([5.0,1.0,1.0,3.0], [2.0,0.0,3.0,4.0], [4.0,0.0,0.0,6.0])
blackbishop
  • 30,945
  • 11
  • 55
  • 76