0

Data Frame structure:

 |     main_id|                  id|           createdBy|
 +------------+--------------------+--------------------+
 |1           |          [10,20,30]|        [999,888,777|
 |2           |                [30]|               [666]|

Expected Data Frame structure:

|     main_id|                  id|           createdBy|
+------------+--------------------+--------------------+
|1                           10                    999
|1                           20                    888
|1                           30                    777
|2           |               30|                   666

Code_1 Tried:

 df.select($"main_id",explode($"id"),$"createdBy").select($"main_id",$"id",explode($"createdBy"))

which is causing wrong pairing and duplicates as well. Any suggestions on what I should tweak to get the required output.

Also I tried using multiple explodes in the first select statement which is throwing errors.

Code_2 Tried:

import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[String], ys: Seq[String]) => xs.zip(ys))

df.withColumn("vars", explode(zip($"id", $"createdBy"))).select(
$"main_id",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show(1)

Warning and Error:

warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 564.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 564.0 (TID 11570, ma4-csxp-ldn1015.corp.apple.com, executor 288)

Yes I have asked the same question which was closed as duplicate pointing towards another solution, which is what I have tried in snippet 2. It didnt work as well. Any suggestions would be really helpful.

data_person
  • 4,194
  • 7
  • 40
  • 75
  • Based on you comments it seems you experience some dependency problems unrelated to code provided in the question. You can start [here](https://stackoverflow.com/q/39953245/8371915) and if it doesn't resolve your problem, please ask another question providing a [mcve] (minimal code, versions of dependencies, submission method, cluster manager) – Alper t. Turker Jun 15 '18 at 11:31

1 Answers1

1

Perhaps the following can help:

val x = someDF.withColumn("createdByExploded", explode(someDF("createdBy"))).select("createdByExploded", "main_id")
val y = someDF.withColumn("idExploded", explode(someDF("id"))).select("idExploded", "main_id")

val xInd = x.withColumn("index", monotonically_increasing_id)
val yInd = y.withColumn("index", monotonically_increasing_id)

val joined = xInd.join(yInd, xInd("index") === yInd("index"), "outer").drop("index")

https://forums.databricks.com/questions/8180/how-to-merge-two-data-frames-column-wise-in-apache.html

aysegulpekel
  • 324
  • 1
  • 4