0

For single column, the input type of UDF is the dataType of this column, while for struct column, the input type is Row, why and how is it implemented?

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil)
val schema = StructType(StructField("subtable", sub_schema,true) :: Nil)
val data = Seq(Row(Row(Array(1,2),"eb")),  Row(Row(Array(3,2,1), "dsf")) )
val rd = sc.parallelize(data)
val df = spark.createDataFrame(rd, schema)
df.printSchema

val u =  udf((x:Row) => x, sub_schema)

root
 |-- subtable: struct (nullable = true)
 |    |-- col1: array (nullable = true)
 |    |    |-- element: integer (containsNull = false)
 |    |-- col2: string (nullable = true)

Spark UDF for StructType / Row

Kamel
  • 1,856
  • 1
  • 15
  • 25
  • Could explain why do you find it unusual? What else would you expect? – 10465355 Nov 14 '18 at 13:14
  • Because the input type of UDF is by convention, which is implemented using reflection and can not be inferred by static type checking. – Kamel Nov 15 '18 at 01:47

0 Answers0