8

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.

Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:

def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
    StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

override def dataType: DataType = ArrayType(DoubleType,true) 

VectorUDT is what I would use with spark.mllib.linalg.Vector: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT I get a runtime error (no errors during the build):

class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg 

Is it expected/can you suggest a workaround?

I am using Spark 2.0.0

1 Answers1

28

In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.

nedim
  • 1,767
  • 1
  • 18
  • 20
  • 1
    how did you find this? i have worked with the spark codebase for four years but found this puzzling. – WestCoastProjects May 08 '18 at 23:08
  • As a note: the source code has this `/** * User-defined type for [[Vector]] in [[mllib-local]] which allows easy interaction with SQL * via [[org.apache.spark.sql.Dataset]]. */ private[spark] class VectorUDT extends UserDefinedType[Vector] {` So there is no mention of the `VectorType` there .. – WestCoastProjects May 08 '18 at 23:09
  • I can't really remember how I found it, but I do remember that it took a long time. – nedim May 09 '18 at 11:23