Spark GroupBy agg collect_list multiple columns

Question

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:

scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
|  A|   D1|  T0|   P1|
|  A|   D0|  T1|   P2|
|  B|   Y1|  T0|   P3|
|  B|   Y2|  T2|   P3|
|  C|   H1|  T0|   P5|
|  C|   H0|  T9|   P5|
|  B|   Y0|  T1|   P2|
|  B|   H1|  T3|   P6|
|  D|   H1|  T2|   P4|
+---+-----+----+-----+


scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)

scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]

scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
|  B|   [Y1, Y2, Y0, H1]|  [T0, T2, T1, T3]|   [P3, P3, P2, P6]|
|  D|               [H1]|              [T2]|               [P4]|
|  C|           [H1, H0]|          [T0, T9]|           [P5, P5]|
|  A|           [D1, D0]|          [T0, T1]|           [P1, P2]|
+---+-------------------+------------------+-------------------+

Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?

Hey Jonathan, did you figure this out? I'm facing a similar issue — outlier123, Jun 06 '18 at 09:47
I'm afraid I did not find a solution but please note that this approach may not scale well for a large amount of data. — Jonathan, Jun 07 '18 at 21:43
Thanks! I found a work around by using dicts, for loop and joins. It does actually scale well for upto 2Billion rows and 30columns. I'll keep you posted if I get around to publishing a library — outlier123, Jun 25 '18 at 11:59
Maybe this could help - [enter link description here](https://stackoverflow.com/questions/37737843/aggregating-multiple-columns-with-custom-function-in-spark) — Maor Aharon, Dec 30 '18 at 10:47
Maybe this could help - [aggregating-multiple-columns-with-custom-function-in-spark](https://stackoverflow.com/questions/37737843/aggregating-multiple-columns-with-custom-function-in-spark) — Maor Aharon, Dec 30 '18 at 10:53
Possible duplicate of [Aggregating multiple columns with custom function in spark](https://stackoverflow.com/questions/37737843/aggregating-multiple-columns-with-custom-function-in-spark) — Maor Aharon, Dec 30 '18 at 10:57
Please refer this link for the solution. I've posted it there. — Sai, Jan 25 '19 at 06:26

score 2 · Answer 1 · answered Jan 09 '20 at 14:40

You can use collect_list(struct(col1, col2)) AS elements.

Example:

df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema

df
 |-- cd_issuer: string (nullable = true)
 |-- cd_doc: string (nullable = true)
 |-- cd_item: string (nullable = true)
 |-- nm_item: string (nullable = true)

outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- cd_item: string (nullable = true)
|    |    |-- nm_item: string (nullable = true)

I been looking for a group by example with collect_list only one I have found. — Fabio, Apr 12 '20 at 20:10

Spark GroupBy agg collect_list multiple columns

1 Answers1