2

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.

I am trying to specify the schema as:

val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:

val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)

But I get a

// scala.MatchError: 20502 (of class java.lang.Long)

Any hint more than welcome

Community
  • 1
  • 1
user299791
  • 2,021
  • 3
  • 31
  • 57

1 Answers1

2

One way to handle this:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}

val rows = myvertexRDD.map{
  case(id, v) => Row.fromSeq(id +: v.toArray)
}

val schema = StructType(
  StructField("id", LongType, false) +: 
  (1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))

val df = sqlContext.createDataFrame(rows, schema)

Notes:

  • declared types have to match actual types. You cannot declare string and pass long or double
  • structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns
zero323
  • 322,348
  • 103
  • 959
  • 935