1

When I'm trying to use UDF that returns the Vector object, Spark throws the following exception:

Cause: java.lang.UnsupportedOperationException: Not supported DataType: org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

How can I use Vector in my UDFs? The Spark version is 1.5.1.

UPD

val dataFrame: DataFrame = sqlContext.createDataFrame(Seq(
  (0, 1, 2),
  (0, 3, 4),
  (0, 5, 6)
)).toDF("key", "a", "b")

val someUdf = udf {
  (a: Double, b: Double) => Vectors.dense(a, b)
}

dataFrame.groupBy(col("key"))
  .agg(someUdf(avg("a"), avg("b")))
Zyoma
  • 1,528
  • 10
  • 17

1 Answers1

1

There is nothing wrong with your UDF per se. It looks like you get an exception because you call it inside agg method on aggregate columns. To make it work you can simply push it outside agg step:

dataFrame
  .groupBy($"key")
  .agg(avg($"a").alias("a"), avg($"b").alias("b"))
  .select($"key", someUdf($"a", $"b"))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank you for your reply. The same code works If I change the Vectors.dense() to i.e. Array(). – Zyoma Oct 07 '15 at 16:50
  • I know. It looks like a problem is specific to a combination of `agg` and `VectorUDT` computed columns. – zero323 Oct 07 '15 at 16:54
  • Your example works for me. Thanks again. But I think this behavior is odd. The `someUdf` works well in `agg` method if I use some primitive type or i.e. `Array`. Can someone explain why this happens? – Zyoma Oct 07 '15 at 16:56
  • Problem with `Vector` is that is not a native Spark SQL type. It is implemented as a User Defined Type (hence `VectorUDT`) with quite complex representation. I guess someone didn't predict use case like this :) Still I have to admit it is confusing. – zero323 Oct 07 '15 at 17:01