Spark - Convert RDD[Vector] to DataFrame with variable columns

Question

what's the best solution to generalize the conversion from RDD[Vector] to DataFrame with scala/spark 1.6. The input are different RDD[Vector]. The columns number in Vector could be from 1 to n for different RDD.

I tried using shapeless library, bat them required declared columns number and type. ES:

val df = rddVector.map(_.toArray.toList)
  .collect  {
          case t: List[Double] if t.length == 3 => t.toHList[Double :: Double :: Double :: HNil].get.tupled.productArity
  }
  .toDF( "column_1", "column_2", "column_3" )

Thanks!

From what I understand, I answered something similar here: https://stackoverflow.com/a/45009516/7224597 Can you check if that works for you? — philantrovert, Aug 01 '17 at 08:42

Sanchit Grover · Accepted Answer · 2017-08-02T06:03:51.503

This worked for me.

  // Create a vector rdd
  val vectorRDD = sc.parallelize(Seq(Seq(123L, 345L), Seq(567L, 789L), Seq(567L, 789L, 233334L))).
    map(s => Vectors.dense(s.toSeq.map(_.toString.toDouble).toArray))

  // Calculate the maximum length of the vector to create a schema 
  val vectorLength = vectorRDD.map(x => x.toArray.length).max()

  // create the dynamic schema
  var schema = new StructType()
  var i = 0
  while (i < vectorLength) {
    schema = schema.add(StructField(s"val${i}", DoubleType, true))
    i = i + 1
  }

  // create a rowRDD variable and make each row have the same arity 
  val rowRDD = vectorRDD.map { x =>
    var row = new Array[Double](vectorLength)
    val newRow = x.toArray

    System.arraycopy(newRow, 0, row, 0, newRow.length);

    println(row.length)

    Row.fromSeq(row)
  }

  // create your dataframe
  val dataFrame = sqlContext.createDataFrame(rowRDD, schema)

Output :

 root
 |-- val0: double (nullable = true)
 |-- val1: double (nullable = true)
 |-- val2: double (nullable = true)

+-----+-----+--------+
| val0| val1|    val2|
+-----+-----+--------+
|123.0|345.0|     0.0|
|567.0|789.0|     0.0|
|567.0|789.0|233334.0|
+-----+-----+--------+

Thanks, bat in this solution you must create a fixed schema. I don't know the schema. The schema is variable. My Spark version is 1.6, no 2.0. — Arturo Gatto, Aug 01 '17 at 13:44
I have updated the answer to accommodate the condition you have provide. It not the neatest of the solution but would work :) — Sanchit Grover, Aug 02 '17 at 05:59

Spark - Convert RDD[Vector] to DataFrame with variable columns

1 Answers1