I am aggregating a table in AWS Glue (spark sql) to get the min
of created_on
and the sum
of total
. However I want to group by id
, id2
, id3
and to get the corresponding engment_partn_text
value for the earliest created_on date. So that the first 3 rows would only show Phill but total still right.
I am running this query within aws glue.
I worked around this dividing this query and joining them back together but I would like to know if there is a more straightforward way.
Current outcome
id | id2 | id3 | created_on | engmt_prtnr_text | total |
---|---|---|---|---|---|
9045692e4557 | 6206435 | XRS00664 | 9/12/2022 | Rey | 15 |
9045692e4557 | 6206435 | XRS00664 | 5/10/2020 | Phil | 10 |
9045692e4557 | 6206435 | XRS00664 | 9/12/2022 | Jenny | 5 |
4fefkb67z39f | 6666269 | zGR00364 | 4/20/2022 | Dan | 3 |
4fefkb67z39f | 6666269 | zGR00364 | 7/26/2026 | HYghes PetYnXz | 5 |
b765430374zk | 6600464 | NEB00023 | 2/2/2021 | William | 17 |
b765430374zk | 6600464 | NEB00023 | 2/2/2022 | Jae | 25 |
d376zze4be63 | 6262633 | YSD07507 | 2/2/2022 | Joseph | 0 |
d376zze4be63 | 6262633 | YSD07507 | 2/2/2021 | Doe | 20 |
Expected outcome
id | id2 | id3 | created_on | engmt_prtnr_text | total |
---|---|---|---|---|---|
9045692e4557 | 6206435 | XRS00664 | 9/12/2022 | Phil | 30 |
4fefkb67z39f | 6666269 | zGR00364 | 4/20/2022 | Dan | 8 |
b765430374zk | 6600464 | NEB00023 | 2/2/2021 | William | 32 |
d376zze4be63 | 6262633 | YSD07507 | 2/2/2022 | Doe | 20 |
The query I managed to do was
WITH
table2 AS (
SELECT
id
,id2
,id3
,min(created_on) AS created_on
,wbs.engmt_prtnr_text
,ROW_NUMBER() OVER(PARTITION BY id, id2,id3 ORDER BY MIN(created_on) DESC) AS position
FROM
source
GROUP BY
id
,id2
,id3
,engmt_prtnr_text
),
table1 AS (
SELECT
id
,id2
,id3
,MIN(created_on) AS created_on
,SUM(total) AS total
FROM
source
GROUP BY
id
,id2
,id3
)
SELECT t1.id
,t1.id2
,t1.id3
,t1.created_on
,t2.engmt_prtnr_text
,t1.total
FROM
tabla1 AS t1
LEFT JOIN
tabl2 AS t2
ON
t1.id= t2.id
AND t1.id2 = t2.id2
AND t1.id3= t2.id3
AND t2.position = 1