Can't use Vector from Spark ML Lib for the DataFrame

Question

When I'm trying to use UDF that returns the Vector object, Spark throws the following exception:

Cause: java.lang.UnsupportedOperationException: Not supported DataType: org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

How can I use Vector in my UDFs? The Spark version is 1.5.1.

UPD

val dataFrame: DataFrame = sqlContext.createDataFrame(Seq(
  (0, 1, 2),
  (0, 3, 4),
  (0, 5, 6)
)).toDF("key", "a", "b")

val someUdf = udf {
  (a: Double, b: Double) => Vectors.dense(a, b)
}

dataFrame.groupBy(col("key"))
  .agg(someUdf(avg("a"), avg("b")))

zero323 · Accepted Answer · 2015-10-07T16:42:33.487

1

There is nothing wrong with your UDF per se. It looks like you get an exception because you call it inside agg method on aggregate columns. To make it work you can simply push it outside agg step:

dataFrame
  .groupBy($"key")
  .agg(avg($"a").alias("a"), avg($"b").alias("b"))
  .select($"key", someUdf($"a", $"b"))

edited Oct 07 '15 at 16:42

answered Oct 07 '15 at 16:28

zero323

322,348
103
959
935

Thank you for your reply. The same code works If I change the Vectors.dense() to i.e. Array(). – Zyoma Oct 07 '15 at 16:50
I know. It looks like a problem is specific to a combination of `agg` and `VectorUDT` computed columns. – zero323 Oct 07 '15 at 16:54
Your example works for me. Thanks again. But I think this behavior is odd. The `someUdf` works well in `agg` method if I use some primitive type or i.e. `Array`. Can someone explain why this happens? – Zyoma Oct 07 '15 at 16:56
Problem with `Vector` is that is not a native Spark SQL type. It is implemented as a User Defined Type (hence `VectorUDT`) with quite complex representation. I guess someone didn't predict use case like this :) Still I have to admit it is confusing. – zero323 Oct 07 '15 at 17:01

Can't use Vector from Spark ML Lib for the DataFrame

1 Answers1