Zip 2 columns in spark

Question

Data Frame structure:

 |     main_id|                  id|           createdBy|
 +------------+--------------------+--------------------+
 |1           |          [10,20,30]|        [999,888,777|
 |2           |                [30]|               [666]|

Expected Data Frame structure:

|     main_id|                  id|           createdBy|
+------------+--------------------+--------------------+
|1                           10                    999
|1                           20                    888
|1                           30                    777
|2           |               30|                   666

Code_1 Tried:

 df.select($"main_id",explode($"id"),$"createdBy").select($"main_id",$"id",explode($"createdBy"))

which is causing wrong pairing and duplicates as well. Any suggestions on what I should tweak to get the required output.

Also I tried using multiple explodes in the first select statement which is throwing errors.

Code_2 Tried:

import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[String], ys: Seq[String]) => xs.zip(ys))

df.withColumn("vars", explode(zip($"id", $"createdBy"))).select(
$"main_id",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show(1)

Warning and Error:

warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 564.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 564.0 (TID 11570, ma4-csxp-ldn1015.corp.apple.com, executor 288)

Yes I have asked the same question which was closed as duplicate pointing towards another solution, which is what I have tried in snippet 2. It didnt work as well. Any suggestions would be really helpful.

Based on you comments it seems you experience some dependency problems unrelated to code provided in the question. You can start [here](https://stackoverflow.com/q/39953245/8371915) and if it doesn't resolve your problem, please ask another question providing a [mcve] (minimal code, versions of dependencies, submission method, cluster manager) — Alper t. Turker, Jun 15 '18 at 11:31

score 1 · Accepted Answer · edited Jun 15 '18 at 10:36

Perhaps the following can help:

val x = someDF.withColumn("createdByExploded", explode(someDF("createdBy"))).select("createdByExploded", "main_id")
val y = someDF.withColumn("idExploded", explode(someDF("id"))).select("idExploded", "main_id")

val xInd = x.withColumn("index", monotonically_increasing_id)
val yInd = y.withColumn("index", monotonically_increasing_id)

val joined = xInd.join(yInd, xInd("index") === yInd("index"), "outer").drop("index")

https://forums.databricks.com/questions/8180/how-to-merge-two-data-frames-column-wise-in-apache.html

Zip 2 columns in spark

1 Answers1