How can I normalize a spark data frame which mainly consists of nested arrays?
case class FooBar(id:String, foo:Seq[String], bar:String, baz: Seq[String])
val f = Seq(FooBar("thinga", Seq("1 "), "1 2 3 ", Seq("2 ")),
FooBar("thinga", Seq("1 2 3 4 "), " 0 0 0 ", Seq("2 3 4 5 ")),
FooBar("thingb", Seq("1 2 "), "1 2 3 4 5 ", Seq("1 2 ")),
FooBar("thingb", Seq("0 ", "0 ", "0 "), "1 2 3 4 5 ", Seq("1 2 3 "))).toDS
f.printSchema
f.show(false)
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| [1 ]| 1 2 3 | [2 ]|
|thinga| [1 2 3 4 ]| 0 0 0 |[2 3 4 5 ]|
|thingb| [1 2 ]|1 2 3 4 5 | [1 2 ]|
|thingb|[0 , 0 , 0 ]|1 2 3 4 5 | [1 2 3 ]|
+------+------------+----------+----------+
scala> f.printSchema
root
|-- id: string (nullable = true)
|-- foo: array (nullable = true)
| |-- element: string (containsNull = true)
|-- bar: string (nullable = true)
|-- baz: array (nullable = true)
| |-- element: string (containsNull = true)
I want to have something like explode which will keep the schema of (id, foo, bar, baz) but return an individual record for each value of the array. The final result should no longer contain arrays.
Foo and baz are related. Their ordering must not be distorted. They always have the same length and the first value of foo is related to the first value of baz - and so on. Maybe I should combine them first into a single column / struct?
The final result should look something like:
+------+------------+----------+----------+
| id| foo| bar| baz|
+------+------------+----------+----------+
|thinga| 1 | 1 | 2 |
|thinga| 1 | 2 | 2 |
|thinga| 1 | 3 | 2 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
|thinga| 1 | 0 |2 |
|thinga| 2 | 0 |3 |
|thinga| 3 | 0 |4 |
|thinga| 4 | 0 |5 |
....
|thingb|0 |1 | 1 |
|thingb|0 |2 | 2 |
|thingb|0 |3 | 3 |
|thingb|0 |4 | 1 |
|thingb|0 |5 | 2 |
|thingb|0 |1 | 3 |
|thingb|0 |2 | 1 |
|thingb|0 |3 | 2 |
|thingb|0 |4 | 3 |
|thingb|0 |5 | 1 |
|thingb|0 |1 | 2 |
|thingb|0 |2 | 3 |
|thingb|0 |3 | 1 |
|thingb|0 |4 | 2 |
|thingb|0 |5 | 3 |
+------+------------+----------+----------+
edit
partially related questions - Explode (transpose?) multiple columns in Spark SQL table