0

I have following data

df.show

+---------+--------------------+------------+
|     name| age|         tokens| tokensCount|
+---------+----+---------------+------------+
|    Alice|  29|        [A,B,C]|           3|
|      Bob|  28|      [A,B,C,D]|           4|
|  Charlie|  29|    [A,B,C,D,E]|           5|    
+---------+----+---------------+------------+

I transform data with following command

val newDF = df.select(($"name") +: (0 until 4).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

+---------+-------+-------+-------+-------+
|     name| token0| token1| token2| token3| 
+---------+-------+-------+-------+-------+
|    Alice|      A|      B|      C|   null|
|      Bob|      A|      B|      C|      D|
|  Charlie|      A|      B|      C|      D|
+---------+-------+-------+-------+-------+

I want to give tokensCount instead of static value 4 at (0 until 4)

I tried a few things like $"tokensCount" and size($"tokens"), but could not get through.

Can anyone suggest how to loop or map according to the size of array or count of array ?

Many thanks

Faaiz
  • 635
  • 8
  • 18

1 Answers1

2

You can modify your code to find the maximum length of tokens, and then use that to create the necessary columns:

val df = Seq(
  ("Alice", 29, Array("A", "B", "C")),
  ("Bob", 28, Array("A", "B", "C", "D")),
  ("Charlie", 29, Array("A", "B", "C", "D", "E")),
).toDF("name", "age", "tokens")

val maxTokenCount = df.withColumn("token_count", size(col("tokens"))).agg(max("token_count") as "mtc")

val newDF = df.select(($"name") +: (0 until maxTokenCount.first.getInt(0)).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

Which will give you:

+-------+------+------+------+------+------+
|   name|token0|token1|token2|token3|token4|
+-------+------+------+------+------+------+
|  Alice|     A|     B|     C|  null|  null|
|    Bob|     A|     B|     C|     D|  null|
|Charlie|     A|     B|     C|     D|     E|
+-------+------+------+------+------+------+

It might be useful to explain why you want to do this transformation, as there might be a much more efficient way. This has the potential to create a very sparse dataframe. Imagine that most names have no tokens, but Bob has 100 tokens: all of a sudden you have one hundred columns of mostly null values.

Jarrod Baker
  • 1,150
  • 8
  • 13
  • Thanks a lot. Your comment is genuine. One issue resolved that gave birth to new one :) I have `mtc = 147` and there are several tuples, where i have `null`, which is not a desired transformation for the input where this DataFrame have to go. I will think on it and write a comment to explain a bit more tomorrow. If possible, give a remark/comment. Many thanks – Faaiz Mar 17 '22 at 23:14
  • Infact, i want to build a binary table for a given set of tokens(e g., A, B,C,D,E). For a specific name, if there exists a token, it should get 1, else 0. For that i have made a schema with nested StructType object. – Faaiz Mar 18 '22 at 06:46