spark normalize data frame of arrays

Question

How can I normalize a spark data frame which mainly consists of nested arrays?

case class FooBar(id:String, foo:Seq[String], bar:String, baz: Seq[String])
val f = Seq(FooBar("thinga", Seq("1 "), "1 2 3 ", Seq("2 ")),
    FooBar("thinga", Seq("1 2 3 4 "), " 0 0 0 ", Seq("2 3 4 5 ")),
    FooBar("thingb", Seq("1 2 "), "1 2 3 4 5 ", Seq("1 2 ")),
    FooBar("thingb", Seq("0 ", "0 ", "0 "), "1 2 3 4 5 ", Seq("1 2 3 "))).toDS
f.printSchema
f.show(false)
+------+------------+----------+----------+
|    id|         foo|       bar|       baz|
+------+------------+----------+----------+
|thinga|        [1 ]|    1 2 3 |      [2 ]|
|thinga|  [1 2 3 4 ]|    0 0 0 |[2 3 4 5 ]|
|thingb|      [1 2 ]|1 2 3 4 5 |    [1 2 ]|
|thingb|[0 , 0 , 0 ]|1 2 3 4 5 |  [1 2 3 ]|
+------+------------+----------+----------+


scala> f.printSchema
root
 |-- id: string (nullable = true)
 |-- foo: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- bar: string (nullable = true)
 |-- baz: array (nullable = true)
 |    |-- element: string (containsNull = true)

I want to have something like explode which will keep the schema of (id, foo, bar, baz) but return an individual record for each value of the array. The final result should no longer contain arrays.

Foo and baz are related. Their ordering must not be distorted. They always have the same length and the first value of foo is related to the first value of baz - and so on. Maybe I should combine them first into a single column / struct?

The final result should look something like:

+------+------------+----------+----------+
|    id|         foo|       bar|       baz|
+------+------------+----------+----------+
|thinga|        1   |    1     |      2   |
|thinga|        1   |      2   |      2   |
|thinga|        1   |        3 |      2   |    
|thinga|  1         |    0     |2         |
|thinga|  2         |    0     |3         |
|thinga|  3         |    0     |4         |
|thinga|  4         |    0     |5         |
|thinga|  1         |    0     |2         |
|thinga|  2         |    0     |3         |
|thinga|  3         |    0     |4         |
|thinga|  4         |    0     |5         |

|thinga|  1         |    0     |2         |
|thinga|  2         |    0     |3         |
|thinga|  3         |    0     |4         |
|thinga|  4         |    0     |5         |

....

|thingb|0           |1         |       1  |
|thingb|0           |2         |       2  |
|thingb|0           |3         |       3  |
|thingb|0           |4         |       1  |
|thingb|0           |5         |       2  |
|thingb|0           |1         |       3  |
|thingb|0           |2         |       1  |
|thingb|0           |3         |       2  |
|thingb|0           |4         |       3  |
|thingb|0           |5         |       1  |
|thingb|0           |1         |       2  |
|thingb|0           |2         |       3  |
|thingb|0           |3         |       1  |
|thingb|0           |4         |       2  |
|thingb|0           |5         |       3  |
+------+------------+----------+----------+

edit

partially related questions - Explode (transpose?) multiple columns in Spark SQL table

Have you looked in posexplode\()https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@posexplode(e:org.apache.spark.sql.Column):org.apache.spark.sql.Column — koiralo, Mar 31 '18 at 09:16
I played with it - but could not yet achieve the desired result. — Georg Heiler, Mar 31 '18 at 09:59
your data is strange, your arrays only contain 1 element (space-separated), also `bar` is no array in your example, but you still want to "explode" it? — Raphael Roth, Mar 31 '18 at 19:14
hello @GeorgHeiler would be a problem if your data is beautified as this: `val f = Seq(FooBar("thinga", Seq("1"), "1 2 3", Seq("2")), FooBar("thinga", Seq("1", "2", "3", "4"), "0 0 0", Seq("2", "3", "4", "5")), FooBar("thingb", Seq("1", "2"), "1 2 3 4 5", Seq("1", "2")), FooBar("thingb", Seq("0", "0", "0"), "1 2 3 4 5", Seq("1", "2", "3"))).toDS()` — abiratsis, Mar 31 '18 at 19:17
This will remove the space from seq and will have multiple items instead of one inside the seq — abiratsis, Mar 31 '18 at 19:26
Sometime as in `foo 0,0,0 ` multiple elements are in the arrays, sometimes the individual elements are only space separated. — Georg Heiler, Mar 31 '18 at 21:23
I see @GeorgHeiler but I believe you will need to clean your data first and solve the other issues later on. I mean it is not possible to have a Seq/Array and inside that one only element. This representation doesn't make sense you need first to choose between Seq(multiple items) or String(one item separated by comma or space). That is the reason that my solution below make the assumption that foo and baz are Seq when id and bar are String. — abiratsis, Apr 01 '18 at 08:35
@AlexandrosBiratsis well the raw data stems from some weird XML files: https://gist.github.com/geoHeil/993298b60a6b29083e5fe1e9bce635bc I wanted to clean / read them into spark via spark-xml and some of the`measInfoId ` blocks have some irregular structure. I hoped to get by with spark built ins - but seems like I need to code a big parsing UDF. — Georg Heiler, Apr 01 '18 at 08:42
Right and maybe it would be better to make a dataset based on the original data e.g all columns only Strings. This way might be easier to find out the exact steps needed also for the cleaning part — abiratsis, Apr 01 '18 at 08:45
So the case right now is that you have inside foo Seq containing only one string element. This element can be separated either by comma or space is that correct? Does this apply for other cols also? — abiratsis, Apr 01 '18 at 08:51
Sadly yes. But separation is always only via space. The command already denotes a list of untrimmed strings. Foo sometimes contains a single element - sometimes a list like outlined in the last case. — Georg Heiler, Apr 01 '18 at 08:58
So after checking the XML @George I think is better initially to represent your data as string and after to cast it into Seq/Array. So initially all columns should be string — abiratsis, Apr 01 '18 at 08:59
But if you agree that coding a large parsing udf with all the columns as input / map operation is most elegant I will do that. Just hoped that a spark-sql built in function would work as these usually make more sense to the query planner. — Georg Heiler, Apr 01 '18 at 09:01
I don't believe that Spark has an automated way to choose how to handle comma or space separated data thus you should always do that by yourself. — abiratsis, Apr 01 '18 at 09:07
So your class now should look like this @Georg `case class FooBar(id:String, foo:String, bar:String, baz: String)` ? — abiratsis, Apr 01 '18 at 09:08
and init data like this: `val f = Seq(FooBar("thinga", "1", "1 2 3", "2"), FooBar("thinga", "1 2 3 4", "0 0 0", "2 3 4 5"), FooBar("thingb", "1 2", "1 2 3 4 5", "1 2"), FooBar("thingb", "0, 0, 0", "1 2 3 4 5", "1, 2, 3")).toDS()` — abiratsis, Apr 01 '18 at 09:09
No you get me wrong. The commata are not inside the string I.e. As seen in the code. Only space is the single separator. — Georg Heiler, Apr 01 '18 at 09:13
OK clear all columns strings separated by space. Correct this time? :) — abiratsis, Apr 01 '18 at 09:15
OK @Georg I updated my post respectively. I added 2 extra steps to convert string cols foo and baz into array. Hopefully it will work for you this time — abiratsis, Apr 01 '18 at 09:34

abiratsis · Accepted Answer · 2018-04-01T09:31:59.807

According to our discussion (please check comments under the initial post) the following data should be valid:

+------+-------+---------+-------+
|    id|    foo|      bar|    baz|
+------+-------+---------+-------+
|thinga|      1|    1 2 3|      2|
|thinga|1 2 3 4|    0 0 0|2 3 4 5|
|thingb|    1 2|1 2 3 4 5|    1 2|
|thingb|  0 0 0|1 2 3 4 5|  1 2 3|
+------+-------+---------+-------+

Then the initialization code goes as next:

case class FooBar(id:String, foo:String, bar:String, baz: String)
val f = Seq(FooBar("thinga", "1", "1 2 3", "2"),
  FooBar("thinga", "1 2 3 4", "0 0 0", "2 3 4 5"),
  FooBar("thingb", "1 2", "1 2 3 4 5", "1 2"),
  FooBar("thingb", "0 0 0", "1 2 3 4 5", "1 2 3")).toDS()

If that is the case then this code should produces the desirable results:

f
  .withColumn("foo", split($"foo", " "))
  .withColumn("baz", split($"baz", " "))
  .withColumn("bar", explode(split($"bar", " ")))
  .map { case Row(id: String, foo: Seq[String], bar: String, baz: Seq[String]) =>
    val c = for ((f, b) <- foo.zip(baz)) yield {
      (f, b)
    }
    (id, bar, c)
  }.toDF(cols: _*)
  .withColumn("foo+baz", explode($"foo+baz"))
  .withColumn("foo", $"foo+baz._1")
  .withColumn("baz", $"foo+baz._2")
  .drop($"foo+bar")
  .select("id", "foo", "bar", "baz")
  .show(100)

First two conversions will split space separated columns foo and baz. Since column bar is string we need to convert into array with split and then to explode it. Map will return a tuple of (id, bar, c) where c is a sequence of tuples (foo, bar). After map we get the next output:

+------+---+--------------------+
|    id|bar|             foo+baz|
+------+---+--------------------+
|thinga|  1|             [[1,2]]|
|thinga|  2|             [[1,2]]|
|thinga|  3|             [[1,2]]|
|thinga|  0|[[1,2], [2,3], [3...|
|thinga|  0|[[1,2], [2,3], [3...|
|thinga|  0|[[1,2], [2,3], [3...|
|thingb|  1|      [[1,1], [2,2]]|
|thingb|  2|      [[1,1], [2,2]]|
|thingb|  3|      [[1,1], [2,2]]|
|thingb|  4|      [[1,1], [2,2]]|
|thingb|  5|      [[1,1], [2,2]]|
|thingb|  1|[[0,1], [0,2], [0...|
|thingb|  2|[[0,1], [0,2], [0...|
|thingb|  3|[[0,1], [0,2], [0...|
|thingb|  4|[[0,1], [0,2], [0...|
|thingb|  5|[[0,1], [0,2], [0...|
+------+---+--------------------+

Next we do one more explode with "foo+baz" this time to extract the final tuples. Right now the output look as below:

+------+---+-------+
|    id|bar|foo+baz|
+------+---+-------+
|thinga|  1|  [1,2]|
|thinga|  2|  [1,2]|
|thinga|  3|  [1,2]|
|thinga|  0|  [1,2]|
|thinga|  0|  [2,3]|
|thinga|  0|  [3,4]|
|thinga|  0|  [4,5]|
|thinga|  0|  [1,2]|
.....
|thingb|  1|  [0,2]|
|thingb|  1|  [0,3]|
|thingb|  2|  [0,1]|
|thingb|  2|  [0,2]|
|thingb|  2|  [0,3]|
|thingb|  3|  [0,1]|
|thingb|  3|  [0,2]|
|thingb|  3|  [0,3]|
|thingb|  4|  [0,1]|
|thingb|  4|  [0,2]|
|thingb|  4|  [0,3]|
|thingb|  5|  [0,1]|
|thingb|  5|  [0,2]|
|thingb|  5|  [0,3]|
+------+---+-------+

Finally, we fill foo and baz columns from foo+baz._1 and foo+baz._2 respectively. This will be the final output:

+------+---+---+---+
|    id|foo|bar|baz|
+------+---+---+---+
|thinga|  1|  1|  2|
|thinga|  1|  2|  2|
|thinga|  1|  3|  2|
|thinga|  1|  0|  2|
|thinga|  2|  0|  3|
|thinga|  3|  0|  4|
|thinga|  4|  0|  5|
|thinga|  1|  0|  2|
|thinga|  2|  0|  3|
|thinga|  3|  0|  4|
|thinga|  4|  0|  5|
|thinga|  1|  0|  2|
|thinga|  2|  0|  3|
|thinga|  3|  0|  4|
|thinga|  4|  0|  5|
|thingb|  1|  1|  1|
|thingb|  2|  1|  2|
|thingb|  1|  2|  1|
|thingb|  2|  2|  2|
|thingb|  1|  3|  1|
|thingb|  2|  3|  2|
|thingb|  1|  4|  1|
|thingb|  2|  4|  2|
|thingb|  1|  5|  1|
|thingb|  2|  5|  2|
|thingb|  0|  1|  1|
|thingb|  0|  1|  2|
|thingb|  0|  1|  3|
|thingb|  0|  2|  1|
|thingb|  0|  2|  2|
|thingb|  0|  2|  3|
|thingb|  0|  3|  1|
|thingb|  0|  3|  2|
|thingb|  0|  3|  3|
|thingb|  0|  4|  1|
|thingb|  0|  4|  2|
|thingb|  0|  4|  3|
|thingb|  0|  5|  1|
|thingb|  0|  5|  2|
|thingb|  0|  5|  3|
+------+---+---+---+

spark normalize data frame of arrays

edit

1 Answers1