0

I have data of following format:

+-----+---------------+
| name|           Data|
+-----+---------------+
|Alpha|      [A, B, C]|
| Beta|   [A, B, C, D]|
|Gamma|[A, B, C, D, E]|
+-----+---------------+

How to transform it into ?

+-----+----+-----+-----+-----+-----+
| name|   A|    B|    C|    D|    E|
+-----+----+-----+-----+-----+-----+
|Alpha|   1|    1|    1|    0|    0|
| Beta|   1|    1|    1|    1|    0|
|Gamma|   1|    1|    1|    1|    1|
+-----+----+-----+-----+-----+-----+

Thanks to @Jarrod Baker for help in similar transformation earlier

Here is the code that i have:

val df = Seq(
      ("Alpha", Array("A", "B", "C")),
      ("Beta", Array("A", "B", "C", "D")),
      ("Gamma", Array("A", "B", "C", "D", "E")),
).toDF("name", "Data")
df.show()

val arrayDataSize = df.withColumn("arr_size", size(col("Data"))).agg(max("arr_size") as "maxSize") 

val newDF = df.select(($"name") +: (0 until arrayDataSize.first.getInt(0)).map(i => {($"Data") (i).contains("A").alias("A") }): _*)

newDF.show()
+-----+----+-----+-----+-----+-----+
| name|   A|    A|    A|    A|    A|
+-----+----+-----+-----+-----+-----+
|Alpha|true|false|false| null| null|
| Beta|true|false|false|false| null|
|Gamma|true|false|false|false|false|
+-----+----+-----+-----+-----+-----+

Thanks in advance for your help.

Faaiz
  • 635
  • 8
  • 18
  • Take a look at [this question](https://stackoverflow.com/questions/64854310/transpose-in-spark-scala-logic) and [this question](https://stackoverflow.com/questions/62803305/scala-explode-method-cartesian-product-multiple-array) – tjheslin1 Apr 02 '22 at 12:19

1 Answers1

0

You can use the RelationalGroupedDataset's pivot method to achieve what you want. To create such a Dataset, you need to use groupBy on a Dataset.

It would look something like this:

import spark.implicits._

val df = Seq(
  ("Alpha", Seq("A", "B", "C")),
  ("Beta", Seq("A", "B", "C", "D")),
  ("Gamma", Seq("A", "B", "C", "D", "E"))
).toDF("name", "Data")

val output = df
  .select(df("name"), explode(col("Data")).alias("Data"))
  .groupBy("name")
  .pivot("Data")
  .count()

output.show()
+-----+---+---+---+----+----+
| name|  A|  B|  C|   D|   E|
+-----+---+---+---+----+----+
| Beta|  1|  1|  1|   1|null|
|Gamma|  1|  1|  1|   1|   1|
|Alpha|  1|  1|  1|null|null|
+-----+---+---+---+----+----+

As you can see, we're first explode-ing our Sequences into separate rows. This allows us to treat each element in each sequence as a separate "entity".

Then, we're using groupBy to get our RelationalGroupedDataset, after which we pivot and count the occurences.

Koedlt
  • 4,286
  • 8
  • 15
  • 33