1

I am using Spark Mllib to generate predictions for my data and then store them to HDFS in Avro format:

val dataPredictions = myModel.transform(myData)
val output = dataPredictions.select("is", "probability", "prediction")
output.write.format("com.databricks.spark.avro").save(path)

I am getting the following Exception:

com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException:
    Unexpected type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.

My understanding is that the 'prediction' column format cannot be serialized as Avro.

  • How do I convert a VectorUDT into an Array so that I can serialize it in Avro?
  • Are there any better alternatives (I can't move away from Avro format)?
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154

1 Answers1

2

To convert any Vector to an Array[Double] you can use the following UDF:

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import org.apache.spark.ml.linalg.Vector

val vectorToArrayUdf = udf((vector: Vector) => vector.toArray)

// The following will work
val output = dataPredictions
    .withColumn("probabilities", vectorToArrayUdf(col("probability")))
    .select("id", "probabilities", "prediction")

output.write.format("com.databricks.spark.avro").save(path)
Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154