Why the input type Spark UDF is Row for struct column? How is it implemented in Spark?

Asked Nov 14 '18 at 12:17

Active Nov 14 '18 at 12:17

Viewed 397 times

For single column, the input type of UDF is the dataType of this column, while for struct column, the input type is Row, why and how is it implemented?

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil)
val schema = StructType(StructField("subtable", sub_schema,true) :: Nil)
val data = Seq(Row(Row(Array(1,2),"eb")),  Row(Row(Array(3,2,1), "dsf")) )
val rd = sc.parallelize(data)
val df = spark.createDataFrame(rd, schema)
df.printSchema

val u =  udf((x:Row) => x, sub_schema)

root
 |-- subtable: struct (nullable = true)
 |    |-- col1: array (nullable = true)
 |    |    |-- element: integer (containsNull = false)
 |    |-- col2: string (nullable = true)

Spark UDF for StructType / Row

asked Nov 14 '18 at 12:17

Kamel

1,856
1
15
25

Could explain why do you find it unusual? What else would you expect? – 10465355 Nov 14 '18 at 13:14
Because the input type of UDF is by convention, which is implemented using reflection and can not be inferred by static type checking. – Kamel Nov 15 '18 at 01:47

Why the input type Spark UDF is Row for struct column? How is it implemented in Spark?

0 Answers0