13

I have been trying to generate unique ids for each row of a table (30 million+ rows).

  • using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
  • the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.

There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.

  • my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)
Mehraban
  • 3,164
  • 4
  • 37
  • 60
user1745713
  • 781
  • 4
  • 10
  • 16

8 Answers8

22

Use the reflect UDF to generate UUIDs.

reflect("java.util.UUID", "randomUUID")

Update (2019)

For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

create table customer (
  id bigint default surrogate_key(),
  name string, 
  city string, 
  primary key (id) disable novalidate
);

To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
Carter Shanklin
  • 2,967
  • 21
  • 18
  • This function is blacklisted in hiveserver2. Is there a workaround to resolve this or is there an alternative method to do this ? – Vinay Kumar Jul 15 '16 at 11:42
  • @VinayKumar - as per the documentation for the reflect UDF - you could write your own custom UDF wrapper for randomUUID which returns the value. reflect is simply a helper to avoid you needing to do that any time you want to call some common Java method. – tswann Oct 28 '16 at 07:08
8

Not sure if this is all that helpful, but here goes...

Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.

When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.

jtravaglini
  • 1,676
  • 11
  • 19
6

Elaborating on the answer by jtravaglini, there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:

INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE

Use like this:

select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,  
...  
;  
...  
OK  
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342  
...

Or you can anonymize that with md5 or similar, here's a link to md5 UDF: https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')

select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...
GreenGiant
  • 4,930
  • 1
  • 46
  • 76
user3287506
  • 141
  • 2
  • 3
3

Use ROW_NUMBER function to generate monotonically increasing integer ids.

select ROW_NUMBER() OVER () AS id from t1;

See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html

ruhong
  • 1,793
  • 5
  • 28
  • 34
2

reflect("java.util.UUID", "randomUUID")

I could not vote up the other one. I needed a pure binary version, so I used this:

unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))

Steven Lowenthal
  • 646
  • 1
  • 5
  • 12
1

Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.

Community
  • 1
  • 1
cabad
  • 4,555
  • 1
  • 20
  • 33
1

Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.

1

If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo

It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values