2

I have the following dataframe with some columns that contains arrays. (We are using spark 1.6)

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     col1     |    col2          |col3          |col4                |col5         |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|foo                 |[Main, Indi...|[1777203, 1777203]|    [GBP, GBP]|            [CR, CR]|   [143, 143]|
+--------------------+--------------+------------------+--------------+--------------------+-------------+

And I expect the following result:

+--------------------+--------------+------------------+--------------+--------------------+-------------+
|            UserName|     explod   |    explod2       |explod3       |explod4             |explod5      |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|NNNNNNNNNNNNNNNNN...|      Main    |1777203           |    GBP      |     CR              |    143      |
|NNNNNNNNNNNNNNNNN...|Individual    |1777203           |    GBP      |     CR              |    143      |
----------------------------------------------------------------------------------------------------------

I have tried a Lateral view:

sqlContext.sql("SELECT `UserName`, explod, explod2, explod3, explod4, explod5 FROM sourceDF
LATERAL VIEW explode(`col1`) sourceDF AS explod 
LATERAL VIEW explode(`col2`) explod AS explod2 
LATERAL VIEW explode(`col3`) explod2 AS explod3 
LATERAL VIEW explode(`col4`) explod3 AS explod4 
LATERAL VIEW explode(`col5`) explod4 AS explod5")

But I get a cartesian product, with a lot of duplicates. I have tried the same, exploding all the columns with a withcolumn approach but still get a lot of duplicates

.withColumn("col1", explode($"col1"))...

Of course I can do a distinct to the final dataframe, but it's not an elegant solution. Is there any way to explode the columns without getting all this duplicates?

Thanks!

AJDF
  • 35
  • 1
  • 5
  • 2
    Possible duplicate of [Explode (transpose?) multiple columns in Spark SQL table](https://stackoverflow.com/questions/33220916/explode-transpose-multiple-columns-in-spark-sql-table) – user10938362 May 21 '19 at 17:39
  • 1
    Hi, that question was for Spark > 2.X, and we are using Spark 1.6, so most of the solutions provided on that question won't work. – AJDF May 22 '19 at 09:19

1 Answers1

5

If you are using Spark 2.4.0 or later, arrays_zip makes the task easier

val df = Seq(
  ("foo",
   Seq("Main", "Individual"),
   Seq(1777203, 1777203),
   Seq("GBP", "GBP"),
   Seq("CR", "CR"),
   Seq(143, 143)))
  .toDF("UserName", "col1", "col2", "col3", "col4", "col5")

df.select($"UserName",
          explode(arrays_zip($"col1", $"col2", $"col3", $"col4", $"col5")))
  .select($"UserName", $"col.*")
  .show()

Output:

+--------+----------+-------+----+----+----+
|UserName|      col1|   col2|col3|col4|col5|
+--------+----------+-------+----+----+----+
|     foo|      Main|1777203| GBP|  CR| 143|
|     foo|Individual|1777203| GBP|  CR| 143|
+--------+----------+-------+----+----+----+
ollik1
  • 4,460
  • 1
  • 9
  • 20