I want to write a UDF over a data frame that operates as comparing values of particular row against the values from same group, where the grouping is by multiple keys. As UDFs operate on a single row, I want to write a query that returns values from same group in as a new column value.
For example over this Input:
id | categoryAB | categoryXY | value1 | value2 |
---|---|---|---|---|
1 | A | X | 0.2 | True |
2 | A | X | 0.3 | False |
3 | A | X | 0.2 | True |
4 | B | X | 0.4 | True |
5 | B | X | 0.1 | True |
6 | B | Y | 0.5 | False |
I can add
- group1: aggregation of value1s from the same <categroyAB, categroyXY> group
- group2: aggregation of value1s from the same <categroyAB, categroyXY> group i.e. same grouping.
Expected result:
id | categoryAB | categoryXY | value1 | value2 | group1 | group2 |
---|---|---|---|---|---|---|
1 | A | X | 0.2 | True | [0.2, 0.3, 0.2] | [True, False, True] |
2 | A | X | 0.3 | False | [0.2, 0.3, 0.2] | [True, False, True] |
3 | A | X | 0.2 | True | [0.2, 0.3, 0.2] | [True, False, True] |
4 | B | X | 0.4 | True | [0.4, 0.1] | [True, True] |
5 | B | X | 0.1 | True | [0.4, 0.1] | [True, True] |
6 | B | Y | 0.5 | False | [0.5] | [False] |
To be more clear about grouping, there are 3 groups in this example
- <A,X> with rows 1, 2 and 3
- <B,X> with rows 4 and 5
- <B,Y> with row 6
I need to implement it in Scala with Spark SQL structures and functions but a generic SQL answer could be guiding.