Data Frame structure:
| main_id| id| createdBy|
+------------+--------------------+--------------------+
|1 | [10,20,30]| [999,888,777|
|2 | [30]| [666]|
Expected Data Frame structure:
| main_id| id| createdBy|
+------------+--------------------+--------------------+
|1 10 999
|1 20 888
|1 30 777
|2 | 30| 666
Code_1 Tried:
df.select($"main_id",explode($"id"),$"createdBy").select($"main_id",$"id",explode($"createdBy"))
which is causing wrong pairing and duplicates as well. Any suggestions on what I should tweak to get the required output.
Also I tried using multiple explodes in the first select statement which is throwing errors.
Code_2 Tried:
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[String], ys: Seq[String]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"id", $"createdBy"))).select(
$"main_id",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show(1)
Warning and Error:
warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 564.0 failed 4 times, most recent failure: Lost task 0.3 in
stage 564.0 (TID 11570, ma4-csxp-ldn1015.corp.apple.com, executor 288)
Yes I have asked the same question which was closed as duplicate pointing towards another solution, which is what I have tried in snippet 2. It didnt work as well. Any suggestions would be really helpful.