1

I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.

For example over this Input:

id categoryAB categoryXY value1 value2
1 A X 0.2 True
2 A X 0.3 False
3 A X 0.2 True
4 B X 0.4 True
5 B X 0.1 True
6 B Y 0.5 False

I can add

  • group1: aggregation of value1s from the same <categroyAB, categroyXY> group
  • group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.

Expected result:

id categoryAB categoryXY value1 value2 group1 group2
1 A X 0.2 True [0.2, 0.3, 0.2] [True, False, True]
2 A X 0.3 False [0.2, 0.3, 0.2] [True, False, True]
3 A X 0.2 True [0.2, 0.3, 0.2] [True, False, True]
4 B X 0.4 True [0.4, 0.1] [True, True]
5 B X 0.1 True [0.4, 0.1] [True, True]
6 B Y 0.5 False [0.5] [False]

To be more clear about grouping, there are 3 groups in this example

  • <A,X> with rows 1, 2 and 3
  • <B,X> with rows 4 and 5
  • <B,Y> with row 6

I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.

1 Answers1

1

There might be a more optimized method, but here how I usually do:

val df = Seq(
  (1, "A", "X", 0.2, true),
  (2, "A", "X", 0.3, false),
  (3, "A", "X", 0.2, true),
  (4, "B", "X", 0.4, true),
  (5, "B", "X", 0.1, true),
  (6, "B", "Y", 0.5, false)
).toDF("id", "categoryAB", "categoryXY", "value1", "value2")

df.join(
  df.groupBy("categoryAB", "categoryXY")
    .agg(
      collect_list('value1) as "group1",
      collect_list('value2) as "group2"
    ),
  Seq("categoryAB", "categoryXY")
).show()

The idea is that I compute separately the aggregation on categoryAB and categoryXY, and then I join the new dataframe to the original one (make sure that df is cached if it is the result of heavy computations as otherwise it will be computed twice).

leleogere
  • 908
  • 2
  • 15
  • 1
    This didn't directly solved my problem but the proposed idea guided a working solution. The collect_list aggregator is the key. I've also used a struct to merge both groups in one column and then accessed with getAs function in UDF! Thanks @leleogeer! – Bahadır İsmail Aydın Jun 08 '22 at 09:39
  • You can also post your solution as another answer, it can help other people searching for something similar :) – leleogere Jun 08 '22 at 09:49