I have a spark dataframe like:
+-------------+------------------------------------------+
|a |destination |
+-------------+------------------------------------------+
|[a,Alice,1] |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]] |
|[c,Charlie,0]|[[b,Bob,0]] |
|[b,Bob,0] |[[c,Charlie,0]] |
|[f,Fanny,0] |[[c,Charlie,0], [h,Fraudster,1]] |
|[d,David,0] |[[a,Alice,1], [e,Esther,0]] |
+-------------+------------------------------------------+
with a schema of
|-- destination: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- var_only_0_and_1: integer (nullable = false)
how can I construct an UDF which operates on the column destination
, i.e. the wrapped array created by collect_list
UDF of spark to calculate the mean of the variable var_only_0_and_1
?