0

I am aggregating a table in AWS Glue (spark sql) to get the min of created_on and the sum of total. However I want to group by id, id2, id3 and to get the corresponding engment_partn_text value for the earliest created_on date. So that the first 3 rows would only show Phill but total still right.

I am running this query within aws glue.

I worked around this dividing this query and joining them back together but I would like to know if there is a more straightforward way.

Current outcome

id id2 id3 created_on engmt_prtnr_text total
9045692e4557 6206435 XRS00664 9/12/2022 Rey 15
9045692e4557 6206435 XRS00664 5/10/2020 Phil 10
9045692e4557 6206435 XRS00664 9/12/2022 Jenny 5
4fefkb67z39f 6666269 zGR00364 4/20/2022 Dan 3
4fefkb67z39f 6666269 zGR00364 7/26/2026 HYghes PetYnXz 5
b765430374zk 6600464 NEB00023 2/2/2021 William 17
b765430374zk 6600464 NEB00023 2/2/2022 Jae 25
d376zze4be63 6262633 YSD07507 2/2/2022 Joseph 0
d376zze4be63 6262633 YSD07507 2/2/2021 Doe 20

Expected outcome

id id2 id3 created_on engmt_prtnr_text total
9045692e4557 6206435 XRS00664 9/12/2022 Phil 30
4fefkb67z39f 6666269 zGR00364 4/20/2022 Dan 8
b765430374zk 6600464 NEB00023 2/2/2021 William 32
d376zze4be63 6262633 YSD07507 2/2/2022 Doe 20

The query I managed to do was

WITH 
table2 AS (
    SELECT
        id
        ,id2
        ,id3
        ,min(created_on) AS created_on
        ,wbs.engmt_prtnr_text
        ,ROW_NUMBER() OVER(PARTITION BY id, id2,id3 ORDER BY MIN(created_on) DESC) AS position
    FROM
         source
    GROUP BY
        id
        ,id2
        ,id3
        ,engmt_prtnr_text
        ),
table1 AS (
    SELECT
        id
        ,id2
        ,id3
        ,MIN(created_on) AS created_on
        ,SUM(total) AS total
    FROM
         source
    GROUP BY
        id
        ,id2
        ,id3
)      
    
SELECT t1.id
       ,t1.id2
       ,t1.id3
       ,t1.created_on
       ,t2.engmt_prtnr_text
       ,t1.total
FROM
    tabla1 AS t1
LEFT JOIN
    tabl2 AS t2
    ON
        t1.id= t2.id
        AND t1.id2 = t2.id2
        AND t1.id3= t2.id3
        AND t2.position = 1
  • Please update your question with the SQL you’ve managed to write. You should also always provide all information as editable text, rather than images, so that someone can copy it if they want to try a solution. It would also be helpful if you provided the source data that results in the output you are showing – NickW Mar 27 '23 at 20:30

0 Answers0