spark UDF operate on array

Question

I have a spark dataframe like:

+-------------+------------------------------------------+
|a            |destination                               |
+-------------+------------------------------------------+
|[a,Alice,1]  |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]]                |
|[c,Charlie,0]|[[b,Bob,0]]                               |
|[b,Bob,0]    |[[c,Charlie,0]]                           |
|[f,Fanny,0]  |[[c,Charlie,0], [h,Fraudster,1]]          |
|[d,David,0]  |[[a,Alice,1], [e,Esther,0]]               |
+-------------+------------------------------------------+

with a schema of

|-- destination: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- var_only_0_and_1: integer (nullable = false)

how can I construct an UDF which operates on the column destination, i.e. the wrapped array created by collect_list UDF of spark to calculate the mean of the variable var_only_0_and_1?

Why the down vote? I know about explode, but would prefer a solution like http://faculty.ucmerced.edu/frusu/Papers/Conference/2017-hpdc-array-udf.pdf — Georg Heiler, Nov 29 '17 at 19:17
And falling back to https://stackoverflow.com/questions/42931796/spark-udf-for-structtype-row will mess with Tungstens optimizations — Georg Heiler, Nov 29 '17 at 19:18

score 5 · Answer 1 · answered Nov 30 '17 at 12:36

You can operate directly on the array as long you get the method signature of the UDF correct (something that has hit me hard in the past). Array columns become visible to a UDF as a Seq, and a Struct as a Row, so you'll need something like this:

def test (in:Seq[Row]): String = {
  // return a named field from the second struct in the array
  in(2).getAs[String]("var_only_0_and_1")
}

var udftest = udf(test _)

I've tested this on data looking like yours. I'm guessing its possible to iterate over the fields of the Seq[Row] in order to achieve what you want.

To be honest, I'm not at all sure about the type safety of doing this, and I believe that explode is the preferable way to do it as per @ayplam. Inbuilt functions will generally be fast than any UDF that a dev provides, as Spark cannot optimise a UDF.

score 0 · Answer 2 · answered Nov 29 '17 at 18:12

0

You can use native spark sql functions for this.

df.withColumn("dest",explode(col("destination")).
groupBy("a").agg(avg(col("dest").getField("var_only_0_and_1")))

answered Nov 29 '17 at 18:12

ayplam

1,943
1
14
20

But explode doesn't really look efficient. Is there a way to operate directly on the array? – Georg Heiler Nov 29 '17 at 19:08

spark UDF operate on array

2 Answers2