0

I have a dataframe in spark:

id   |   itemid   |   itemquant   |  itemprice 
-------------------------------------------------
 A   |    1,2,3   |   2,2,1       |  30,19,10
 B   |    3,5     |   5,8         |  18,40

Here all the columns are of string datatype.

How can I use explode function across multiple columns and create a new dataframe shown below:

id   |   itemid   |   itemquant   |  itemprice 
-------------------------------------------------
 A   |     1      |      2        |     30
 A   |     2      |      2        |     19
 A   |     3      |      1        |     10
 B   |     3      |      5        |     18
 B   |     5      |      8        |     40

Here in the new dataframe also, all the columns are of string datatype.

Shaido
  • 27,497
  • 23
  • 70
  • 73
Dipanjan Das
  • 33
  • 1
  • 4

1 Answers1

1

you need an UDF for that:

val df = Seq(
  ("A","1,2,3","2,2,1","30,19,10"),
  ("B","3,5","5,8","18,40")
).toDF("id","itemid","itemquant","itemprice")

val splitAndZip = udf((col1:String,col2:String,col3:String) => {
  col1.split(',').zip(col2.split(',')).zip(col3.split(',')).map{case ((a,b),c) => (a,b,c)}
})

df
  .withColumn("tmp",explode(splitAndZip($"itemId",$"itemquant",$"itemprice")))
  .select(
    $"id",
    $"tmp._1".as("itemid"),
    $"tmp._2".as("itemquant"),
    $"tmp._3".as("itemprice")
  )
  .show()

+---+------+---------+---------+
| id|itemid|itemquant|itemprice|
+---+------+---------+---------+
|  A|     1|        2|       30|
|  A|     2|        2|       19|
|  A|     3|        1|       10|
|  B|     3|        5|       18|
|  B|     5|        8|       40|
+---+------+---------+---------+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145