1

I have a table having columns id,create_time,code. create_time column is of type string having timestamp value in the format yyyy-MM-dd HH:mm:ss.SSSSSS Now my requirement is to find the latest code(recent create_time) for each id. If the create_time column has no milliseconds part, I can do

select id,create_time,code from(
select id,max(unix_timestamp(create_time,"yyyy-MM-dd HH:mm:ss")) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time

As unix time functions consider only seconds not milliseconds, am not able to proceed with them.

Please help

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Vanaja Jayaraman
  • 753
  • 3
  • 18

2 Answers2

0

Why would you try to convert at all? Since you are only looking for the latest timestamp I would just do:

select id,create_time,code from(
select id,max(create_time) over (partition by id) as latest_time from table)a
join table b on a.latest_time=b.create_time

The ones without miliseconds will be treated, as they would have "000000" instead.

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • create_time column is of string type. Are you sure max(create_time) time will give me the latest timestamp? – Vanaja Jayaraman Aug 21 '19 at 11:25
  • Yes, hive supports max for strings. Max ~ first record in descending order in this sense. – Grzegorz Skibinski Aug 21 '19 at 11:28
  • When am doing the same thing using group by and join instead of using max over partition am getting different results. The query I used is select id,create_time,code from( select id,max(create_time) as latest_time from table group by id)a join table b on a.latest_time=b.create_time – Vanaja Jayaraman Aug 21 '19 at 11:55
  • I suppose you get less rows, because you aggregate by id, while with windowing function - you probably duplicate a lot of rows. So in the first example your nested query, with windowing function will produce id, max(create_time), as many times, as many rows for the same id you have, which in turn will produce duplication in the general query (but maybe this is what you want to obtain). So the second query would be intuitively better (i.e. example without windowing function) – Grzegorz Skibinski Aug 21 '19 at 12:09
  • In the query you mentioned I added join condition on id also. Else if multiple id's have same create_time, it fails. Everything else is fine. Thanks a lot!!! – Vanaja Jayaraman Aug 22 '19 at 06:33
0

You do not need join for this.

If you need all records with max(create_time), use rank() or dense_rank(). Rank will assign 1 to all records with the latest create_time if there are many records with the same time.

If you need only one record per id even it there are many records with create_time=max(create_time), then use row_number() instead of rank():

select id,create_time,code 
from
(
select id,create_time,code,
       rank() over(partition by id order by create_time desc) rn
)s
where rn=1;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • My issue was with milliseconds part of create_time column. – Vanaja Jayaraman Aug 22 '19 at 06:34
  • @VanajaJayaraman I see. I just provided more optimal way of gettimg records with max(create_time). This will work fine with milliseconds because your timestamps are in the sortable format, this means you can use them directly without conversion. – leftjoin Aug 22 '19 at 06:40