hive sql find the latest record

Question

the table is:

create table test (
id string,
name string,
age string,
modified string)

data like this:

id    name   age  modifed
1     a      10   2011-11-11 11:11:11
1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-10 10:11:12
2     b      20   2012-12-12 10:11:12
2     b      20   2012-12-15 10:11:12

I want to get the latest record(include every colums id,name,age,modifed) group by id,as the data above,the correct result is:

1     a      11   2012-11-11 12:00:00
2     b      20   2012-12-15 10:11:12

I do like this:

insert overwrite table t 
select b.id, b.name, b.age, b.modified 
from (
        select id,max(modified) as modified 
        from test 
        group by id
) a 
left outer join test b on (a.id=b.id  and a.modified=b.modified);

This sql can get the right result,but when mass data，it runs slow.

**Is there any way to do this without left outer join? **

thanks for the question and the answer, they solved my problem completely! — eleforest, Sep 17 '14 at 09:13

score 53 · Answer 1 · answered Nov 30 '13 at 12:57

53

There's a nearly undocumented feature of Hive SQL (I found it in one of their Jira bug reports) that lets you do something like argmax() using struct()s. For example if you have a table like:

test_argmax
id,val,key
1,1,A
1,2,B
1,3,C
1,2,D
2,1,E
2,1,U
2,2,V
2,3,W
2,2,X
2,1,Y

You can do this:

select 
  max(struct(val, key, id)).col1 as max_val,
  max(struct(val, key, id)).col2 as max_key,
  max(struct(val, key, id)).col3 as max_id
from test_argmax
group by id

and get the result:

max_val,max_key,max_id
3,C,1
3,W,2

I think in case of ties on val (the first struct element) it will fall back to comparison on the second column. I also haven't figured out whether there's a neater syntax for getting the individual columns back out of the resulting struct, maybe using named_struct somehow?

answered Nov 30 '13 at 12:57

patricksurry

5,508
2
27
38

1

This is a great solution, I like it a lot! Thank you very much. – Clemens Valiente Oct 16 '14 at 12:02
could you explain the role of the col1, col2, col3 attributes? – Quetzalcoatl Aug 28 '18 at 01:18
`col1, ...` are just the default field names for the (unnamed) struct. It might be clearer to write the more verbose `max(named_struct('val', val, 'key', key, 'id', id)).id as max_id` – patricksurry Sep 13 '18 at 20:20

Rahul Sharma · Answer 2 · 2019-05-17T02:39:25.650

11

Just slightly different approach than what has been answered in previous answer.

Below example uses hive windowing function to find out the latest record, read more here

SELECT t.id
    ,t.name
    ,t.age
    ,t.modified
FROM (
    SELECT id
        ,name
        ,age
        ,modified
        ,ROW_NUMBER() OVER (
            PARTITION BY id ORDER BY unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') DESC
            ) AS ROW_NUMBER   
    FROM test
    ) t
WHERE t.ROW_NUMBER <= 1;

The modified is string so converting it to timestamp using unix_timestamp(modified,'yyyy-MM-dd hh:mm:ss') then applying order by on timestamp.

edited May 17 '19 at 02:39

answered Apr 20 '17 at 20:01

Rahul Sharma

5,614
10
57
91

1

I wonder, do we need the unix_timestamp modifier? If modified is a string, it would be unnecessary, right? – Diogo Santos Aug 07 '20 at 08:55

score 10 · Answer 3 · answered May 09 '15 at 15:24

There is a relatively recent feature of Hive SQL, analytic functions and the over clause. This should do the job without joins

select id, name, age, last_modified 
from ( select id, name, age, modified, 
              max( modified) over (partition by id) as last_modified 
       from test ) as sub
where   modified = last_modified

What's going on here is that the subquery produces a new row with an extra column last_modified which has the latest modified timestamp for the corresponding person's id. (Similar to what group by would do) The key here is that the subquery gets you again one row per row in your original table and then you filter from that.

There is a chance that even the simpler solution works:

select  id, name, age,  
        max( modified) over (partition by id) last_modified 
from test 
where   modified = last_modified

By the way, the same code would work in Impala, too.

how do you treat identical max(modified) ? – lydias Jan 15 '20 at 20:28 — lydias, Jan 15 '20 at 20:28

score 6 · Answer 4 · answered Nov 23 '12 at 04:29

6

Give this a try:

select t1.* from test t1
join (
  select id, max(modifed) maxModified from test
  group by id
) s
on t1.id = s.id and t1.modifed = s.maxModified

Fiddle here.

Left outer join solution here.

Let us know which one runs faster :)

answered Nov 23 '12 at 04:29

Mosty Mostacho

42,742
16
96
123

your sql,Time taken: 325.579 seconds Total MapReduce CPU Time Spent: 11 minutes 36 seconds 130 msec,6 jobs. my sql,Time taken: 220.736 seconds Total MapReduce CPU Time Spent: 12 minutes 13 seconds 80 msec,5 job. – qiulp Nov 23 '12 at 04:48
1

Note that your query is the same as the first one I posted (I've just realised) but yours uses an unnecessary left join. An inner join is enough. And what about the REAL left outer join solution I provided? That may probably take more. BTW, make sure you replace `t1.*` with only the necessary fields, of course. – Mosty Mostacho Nov 23 '12 at 05:03
"Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive." left join .. on .. and t1.modifed < t2.modifed,the sql is wrong in Hive. – qiulp Nov 23 '12 at 05:14

SRIRAM · Answer 5 · 2012-11-23T04:59:15.677

0

try this

select id,name,age,modified from test
 where modified=max(modified)
 group by id,name

edited Nov 23 '12 at 04:59

answered Nov 23 '12 at 04:50

SRIRAM

1,888
2
17
17

age can been changeed,so it can't "group by id,name,age",just like this:1 a 10 2011-11-11 11:11:11 1 a 11 2012-11-11 12:00:00 – qiulp Nov 23 '12 at 04:56

score 0 · Answer 6 · answered Nov 28 '12 at 15:52

0

If u can make sure that row which has max modified also has max age in same id row set.

Try

select id, name, max(age), max(modified) 
from test
group by id, name

answered Nov 28 '12 at 15:52

pensz

1,871
1
13
18

user 923227 · Answer 7 · 2013-03-14T21:52:45.263

Presume the data is like this:

    id      name    age     modifed
    1       a       10      2011-11-11 11:11:11
    1       a       11      2012-11-11 12:00:00
    2       b       23      2012-12-10 10:11:12
    2       b       21      2012-12-10 10:11:12
    2       b       22      2012-12-15 10:11:12
    2       b       20      2012-12-15 10:11:12

then the result of the above query will give you - (notice the repeated 2, b having the same date time)

    1       a       11      2012-11-11 12:00:00
    2       b       22      2012-12-15 10:11:12
    2       b       20      2012-12-15 10:11:12

This query runs an additional group by and is less efficient but gives the correct result -

    select collect_set(b.id)[0], collect_set(b.name)[0], collect_set(b.age)[0], b.modified
    from
        (select id, max(modified) as modified from test group by id) a
      left outer join
        test b
      on
        (a.id=b.id and a.modified=b.modified)
    group by
      b.modified;

then the result of the above query will give you

    1       a       11      2012-11-11 12:00:00
    2       b       20      2012-12-15 10:11:12

Now if we improve the query a little - then in place of 3 MRs it runs only one Keping the result same -

    select id, collect_set(name)[0], collect_set(age)[0], max(modified)
    from test 
    group by id;

Note - this will slow down in case your group by field produces large results.

score 0 · Answer 8 · answered Jul 23 '15 at 00:10

0

You can get the required result without using left outer join like this:

select * from test where (id, modified) in(select id, max(modified) from test group by id)

http://sqlfiddle.com/#!2/bfbd5/42

answered Jul 23 '15 at 00:10

aditya

1
1
1

hive sql find the latest record

8 Answers8

Linked