I have been trying to generate unique ids for each row of a table (30 million+ rows).
- using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
- the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
- my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)