Hive - Remove duplicates, keeping newest record - all of it

Question

There have been a few questions like this, with no answer, like this one here.

I thought I would post another in hopes of getting one.

I have a hive table with duplicate rows. Consider the following example:

*ID             Date           value1         value2*
1001            20160101       alpha          beta
1001            20160201       delta          gamma
1001            20160115       rho            omega
1002            20160101       able           charlie
1002            20160101       able           charlie

When complete, I only want two records. Specifically, these two:

*ID             Date           value1         value2*
1001            20160201       delta          gamma
1002            20160101       able           charlie

Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.

So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.

I hope my explanation is clear, and I appreciate any insight.

Thank you

does [this](http://stackoverflow.com/a/38398563/4964651) help? — mtoto, Oct 04 '16 at 13:18
Thank mtoto. Works great!!! I guess I am going to have to get into Windowing. Post as a "answer" and you get a check. — Wanderer, Oct 05 '16 at 17:44

deusxmach1na · Accepted Answer · 2022-03-16T18:28:18.023

Try this:

WITH temp_cte AS (
        SELECT      mt.ID                                                           AS ID 
                    , mt.Date                                                       AS Date 
                    , mt.value1                                                     AS value1 
                    , mt.value2                                                     AS value2
                    , ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC)  AS row_num
        FROM        my_table mt
    )
    SELECT      tc.ID           AS ID 
                , tc.Date       AS Date 
                , tc.value1     AS value1
                , tc.value2     AS value2
    FROM        temp_cte tc 
    WHERE       tc.row_num = 1 
    ;

Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.

Edit March 2022: Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.

WITH temp_cte AS (
    SELECT      mt.ID                                                                               AS ID 
                , MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2))      AS my_struct
    FROM        my_table mt
    GROUP BY    mt.ID 
)
SELECT      tt.ID                       AS ID 
            , tt.my_struct.Date         AS Date
            , tt.my_struct.Value1       AS Value1
            , tt.my_struct.Value2       AS Value2
FROM        temp_cte tt 
;

this soln would not work if the sequence of ids are not consecutive in the input table. — Parijat Bose, Feb 27 '22 at 23:09
I'm not understanding. They wanted the max date for each ID and the ROW_NUMBER() partitions by ID. It will work for any ID set. However, since answering this question I found a better way to do this in Hive using NAMED_STRUCT() — deusxmach1na, Mar 16 '22 at 18:10
WITH temp_cte AS ( SELECT mt.ID AS ID , MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2)) AS my_struct FROM my_table mt GROUP BY mt.ID ) SELECT tt.ID AS ID , tt.my_struct.Date AS Date , tt.my_struct.Value1 AS Value1 , tt.my_struct.Value2 AS Value2 FROM temp_cte tt ; — deusxmach1na, Mar 16 '22 at 18:11
The only time it might not work is if ID is NULL. I'll let someone else test that though. — deusxmach1na, Mar 16 '22 at 18:13

Hive - Remove duplicates, keeping newest record - all of it

1 Answers1