generating unique ids in hive

Question

I have been trying to generate unique ids for each row of a table (30 million+ rows).

using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.

There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.

my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)

Have you considered [UUIDs](http://docs.oracle.com/javase/1.5.0/docs/api/java/util/UUID.html)? — Mike Park, Aug 15 '13 at 18:46

Carter Shanklin · Accepted Answer · 2019-12-10T23:56:44.577

22

Use the reflect UDF to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Update (2019)

For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

create table customer (
  id bigint default surrogate_key(),
  name string, 
  city string, 
  primary key (id) disable novalidate
);

To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;

edited Dec 10 '19 at 23:56

answered Aug 18 '13 at 04:54

Carter Shanklin

2,967
21
18

This function is blacklisted in hiveserver2. Is there a workaround to resolve this or is there an alternative method to do this ? – Vinay Kumar Jul 15 '16 at 11:42
@VinayKumar - as per the documentation for the reflect UDF - you could write your own custom UDF wrapper for randomUUID which returns the value. reflect is simply a helper to avoid you needing to do that any time you want to call some common Java method. – tswann Oct 28 '16 at 07:08

score 8 · Answer 2 · answered Aug 15 '13 at 18:04

8

Not sure if this is all that helpful, but here goes...

Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.

When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.

answered Aug 15 '13 at 18:04

jtravaglini

1,676
11
19

Can any one provide me sample, Im a .nET developer would like to implement my custom mapper function – user145610 Dec 23 '15 at 18:24

score 6 · Answer 3 · edited Jul 08 '19 at 21:23

Elaborating on the answer by jtravaglini, there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:

INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE

Use like this:

select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,  
...  
;  
...  
OK  
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342  
...

Or you can anonymize that with md5 or similar, here's a link to md5 UDF: https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')

select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...

Note: the virtual columns use double-underscores: `INPUT__FILE__NAME` `BLOCK__OFFSET__INSIDE__FILE` — bcollins, Mar 22 '18 at 04:18
Beware! in case of parquet file format, the above solution returned multiple row with same row_id though data was different. — Gyanendra Dwivedi, Jul 24 '19 at 16:59

score 3 · Answer 4 · answered Mar 29 '19 at 02:19

3

Use ROW_NUMBER function to generate monotonically increasing integer ids.

select ROW_NUMBER() OVER () AS id from t1;

See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html

answered Mar 29 '19 at 02:19

ruhong

1,793
5
28
34

score 2 · Answer 5 · answered Mar 15 '14 at 22:02

2

reflect("java.util.UUID", "randomUUID")

I could not vote up the other one. I needed a pure binary version, so I used this:

unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))

answered Mar 15 '14 at 22:02

Steven Lowenthal

646
1
5
12

score 1 · Answer 6 · edited May 23 '17 at 12:17

1

Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.

edited May 23 '17 at 12:17

Community

1
1

answered Aug 15 '13 at 18:32

cabad

4,555
1
20
33

score 1 · Answer 7 · answered Nov 24 '14 at 09:11

1

Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.

answered Nov 24 '14 at 09:11

Rangarajan Vasudevan

11
1

score 1 · Answer 8 · answered Feb 15 '17 at 09:43

If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo

It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values

generating unique ids in hive

8 Answers8