13

How can I generate row numbers for an existing table while running a select query?
For example:

select row_number(), * from emp;

I am using hive 0.13. I can't access external jars or udfs in my environment. The underlying files are in parquet format.

Thanks in advance!

dur
  • 15,689
  • 25
  • 79
  • 125
Gayatri Mahesh
  • 327
  • 1
  • 3
  • 12
  • 2
    If you try to run this kind of analytic functions on large data sets (i.e. over 50 million rows) then be careful to test your data consistency. I have seen subtle **data corruption** occur in a **deterministic way** with V0.13 and V0.14 -- the row numbers were in sequence but some thousands of rows had been dropped and replaced by a copy of other rows. But that may be specific to Hive-on-TEZ. – Samson Scharfrichter May 28 '16 at 11:32

2 Answers2

36

ROW_NUMBER() is a windowing function so it needs to be used in conjunction with an OVER clause. Just don't specify any PARTITION.

SELECT *, ROW_NUMBER() OVER () AS row_num
FROM emp
--- other stuff
o-90
  • 17,045
  • 10
  • 39
  • 63
  • 11
    appears to need an explicit ordering at least in my version of hiveql, e.g. `SELECT *, ROW_NUMBER() OVER (ORDER BY some_emp_field) AS row_num FROM emp` – patricksurry Nov 02 '17 at 12:27
16

row_number() can be used to find for example, recent visit of a user on your site.

SELECT user_id,user_name,timestamp
FROM (
SELECT user_id,user_name,timestamp,row_number() over (partition by userid order by timestamp desc) as visit_number 
from user) user_table
    WHERE visit_number = 1
sumitya
  • 2,631
  • 1
  • 19
  • 32