32

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I'm not familiar enough to be sure yet) and I have a table that I am trying to re-structure, but I'm getting stuck trying to transpose multiple columns at the same time.

Basically I have data that looks like:

userId    someString      varA     varB
   1      "example1"    [0,2,5]   [1,2,9]
   2      "example2"    [1,20,5]  [9,null,6]

and I'd like to explode both varA and varB simultaneously (the length will always be consistent) - so that the final output looks like this:

userId    someString      varA     varB
   1      "example1"       0         1
   1      "example1"       2         2
   1      "example1"       5         9
   2      "example2"       1         9
   2      "example2"       20       null
   2      "example2"       5         6

but I can only seem to get a single explode(var) statement to work in one command, and if I try to chain them (ie create a temp table after the first explode command) then I obviously get a huge number of duplicate, unnecessary rows.

Many thanks!

anthr
  • 1,026
  • 4
  • 17
  • 34

3 Answers3

49

Spark >= 2.4

You can skip zip udf and use arrays_zip function:

df.withColumn("vars", explode(arrays_zip($"varA", $"varB"))).select(
  $"userId", $"someString",
  $"vars.varA", $"vars.varB").show

Spark < 2.4

What you want is not possible without a custom UDF. In Scala you could do something like this:

val data = sc.parallelize(Seq(
    """{"userId": 1, "someString": "example1",
        "varA": [0, 2, 5], "varB": [1, 2, 9]}""",
    """{"userId": 2, "someString": "example2",
        "varA": [1, 20, 5], "varB": [9, null, 6]}"""
))

val df = spark.read.json(data)

df.printSchema
// root
//  |-- someString: string (nullable = true)
//  |-- userId: long (nullable = true)
//  |-- varA: array (nullable = true)
//  |    |-- element: long (containsNull = true)
//  |-- varB: array (nullable = true)
//  |    |-- element: long (containsNull = true)

Now we can define zip udf:

import org.apache.spark.sql.functions.{udf, explode}

val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))

df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
   $"userId", $"someString",
   $"vars._1".alias("varA"), $"vars._2".alias("varB")).show

// +------+----------+----+----+
// |userId|someString|varA|varB|
// +------+----------+----+----+
// |     1|  example1|   0|   1|
// |     1|  example1|   2|   2|
// |     1|  example1|   5|   9|
// |     2|  example2|   1|   9|
// |     2|  example2|  20|null|
// |     2|  example2|   5|   6|
// +------+----------+----+----+

With raw SQL:

sqlContext.udf.register("zip", (xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.registerTempTable("df")

sqlContext.sql(
  """SELECT userId, someString, explode(zip(varA, varB)) AS vars FROM df""")
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Can this be applied on 3 columns which are of type sequence? – Amit Kumar Feb 22 '17 at 18:06
  • @AmitKumar Yeah, why not? You'll have to adjust signature and body but it is not hard. – zero323 Feb 23 '17 at 05:59
  • I wonder if in the newer datasets API you could just use map and zip the arrays together without creating the UDF and whether it would be faster/ scale/ be optimised by the catalyst execution engine. I'll try it when at a console. – Davos Jul 21 '17 at 14:17
  • @zero323 can you please help me how to write the above UDF in java for more than three columns? – ROOT Jan 30 '18 at 08:28
  • @SatishKaruturi, Starting Spark 2.4.0 you no longer need to write your own UDF. There is a new `arrays_zip` function which can be applied to multiple columns. – haimco Sep 12 '18 at 13:47
  • To add to what @zero323 explained. If you have array columns of different lengths , you may want to use zipAll instead of zip – Yeikel Jan 28 '19 at 20:19
  • @zero323 can you pls post Java version of the UDF function? Thanks. – Popeye Oct 04 '19 at 18:14
  • @zero323 is there a way I can use this udf to explode an array[int] and an array[struct] at the same time? I can't figure out how to specify array[struct] in your udf – Amazonian Jun 07 '20 at 07:34
1

You could also try

case class Input(
 userId: Integer,
 someString: String,
 varA: Array[Integer],
 varB: Array[Integer])

case class Result(
 userId: Integer,
 someString: String,
 varA: Integer,
 varB: Integer)

def getResult(row : Input) : Iterable[Result] = {
 val user_id = row.user_id
 val someString = row.someString
 val varA = row.varA
 val varB = row.varB
 val seq = for( i <- 0 until varA.size) yield {Result(user_id,someString,varA(i),varB(i))}
 seq
 }

val obj1 = Input(1, "string1", Array(0, 2, 5), Array(1, 2, 9))
val obj2 = Input(2, "string2", Array(1, 3, 6), Array(2, 3, 10))
val input_df = sc.parallelize(Seq(obj1, obj2)).toDS

val res = input_df.flatMap{ row => getResult(row) }
res.show
// +------+----------+----+-----+
// |userId|someString|varA|varB |
// +------+----------+----+-----+
// |     1|  string1 |   0|   1 |
// |     1|  string1 |   2|   2 |
// |     1|  string1 |   5|   9 |
// |     2|  string2 |   1|   2 |
// |     2|  string2 |   3|   3 |
// |     2|  string2 |   6|   10|
// +------+----------+----+-----+
0

This will work even if we have more than 3 columns

case class Input(user_id: Integer, someString: String, varA: Array[Integer], varB: Array[Integer], varC: Array[String], varD: Array[String])

val obj1 = Input(1, "example1", Array(0,2,5), Array(1,2,9), Array("a","b","c"), Array("red","green","yellow"))
val obj2 = Input(2, "example2", Array(1,20,5), Array(9,null,6), Array("d","e","f"), Array("white","black","cyan"))
val obj3 = Input(3, "example3", Array(10,11,12), Array(5,8,7), Array("g","h","i"), Array("blue","pink","brown"))

val input_df = sc.parallelize(Seq(obj1, obj2, obj3)).toDS
input_df.show()

val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

val output_df = input_df.withColumn("vars", explode(zip($"varA", $"varB", $"varC", $"varD"))).
                         select($"user_id", $"someString", $"vars._1".alias("varA"), $"vars._2".alias("varB"), $"vars._3".alias("varC"), $"vars._4".alias("varD"))
output_df.show()
Aman Mundra
  • 854
  • 12
  • 28