How to transform two arrays of each column into a pair for a Spark DataFrame?

Question

I have a DataFrame which has two columns of array values like below

var ds = Seq((Array("a","b"),Array("1","2")),(Array("p","q"),Array("3","4")))
var df = ds.toDF("col1", "col2")

+------+------+
|  col1|  col2|
+------+------+
|[a, b]|[1, 2]|
|[p, q]|[3, 4]|
+------+------+

I want to transform this into an array of pairs like below

+------+------+---------------+
|  col1|  col2|           col3|
+------+------+---------------+
|[a, b]|[1, 2]|[[a, 1],[b, 2]]|
|[p, q]|[3, 4]|[[p, 3],[q, 4]]|
+------+------+---------------+

I guess I can use struct and then some udf. But I wanted to know if there is any built-in higher order method to do this efficiently.

Note: arrays of col1 and col2 will always have an equal number of items — Roy, Jun 19 '20 at 01:56

score 1 · Answer 1 · answered Jun 19 '20 at 03:15

From Spark-2.4 use arrays_zip function.

Example:

df.show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|[a, b]|[1, 2]|
#|[p, q]|[3, 4]|
#+------+------+
from pyspark.sql.functions import *
df.withColumn("col3",arrays_zip(col("col1"),col("col2"))).show()
#+------+------+----------------+
#|  col1|  col2|            col3|
#+------+------+----------------+
#|[a, b]|[1, 2]|[[a, 1], [b, 2]]|
#|[p, q]|[3, 4]|[[p, 3], [q, 4]]|
#+------+------+----------------+

Thanks @Shu, but I am using Spark 2.3. – Roy Jun 22 '20 at 21:18 — Roy, Jun 22 '20 at 21:18

score 0 · Accepted Answer · answered Jun 22 '20 at 21:18

For Spark-2.3 or below, I found the iterator zip method really handy for this use case (which I was unaware of while posting the question). I can define a small UDF

val zip = udf((xs: Seq[String], ys: Seq[String]) => xs.zip(ys))

and use as

var out = df.withColumn("col3", zip(df("col1"), df("col2")))

This gives me desired result.

How to transform two arrays of each column into a pair for a Spark DataFrame?

2 Answers2