1

I would like to generate unique 64-bit keys (pseudo)-randomly to identify objects in our model. I need the keys to be as unique as possible (minimize the probability of collisions when any N keys are used together) across all users of the system.

Usual GUIDs are out of the question for now because we're data cheap :). Because I don't foresee needing more than 1 million keys used in the same context, I would think 64-bit is enough (probability of collision would be about ~10e-7).

As a side-note, I will also need a scheme to fold/hash tuples of those keys into a single 64-bit key that also needs to be well distributed/unique.

Since I need a good (well distributed) hashing function anyway, would it be ok to fold a GUID in half (maybe accounting for the fixed bits in a GUID in some way)? Or is it better to use a local RNG? How would I seed the RNG to maximize uniqueness across space/time of generation? How strong a RNG would I need?

I'm not particularly looking for efficiency (up to a point) but I'd really like to ensure that the probabilities hold up their promise!

fparadis2
  • 913
  • 7
  • 12

1 Answers1

2

hash a counter using a fast 128bit cryptographic hash like md5 and then split into two. that will give you "random", independent values that are 64bits each, and it should be pretty efficient.

and are you sure you can't use a simple counter?

update if you need a distributed solution, simply place a counter on each machine and hash the MAC address of the machine plus the counter. for better throughput use multiple counters per machine, each with a different name (A, B etc), and hash the name too. this is the big advantage of using hashes - you can throw anything in there. just be careful not to have ambiguities (for example, put "-" between each thing you hash, so that a name of "1" plus a count of "23" is not confused with a name of "12" and a count of "3").

andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • We currently have a server that distributes ranges of 32-bit keys that guarantees the uniqueness, but using this kind of simple counter is exactly what I'm trying to avoid (single point of failure, connectivity issues). – fparadis2 Aug 30 '11 at 13:21
  • You can use a previous ID instead of counter. This way you can have a lot of independent "counters" because each MD5 gives you two. – Rotsor Aug 30 '11 at 13:27
  • or have prefixes for different sources. the choices are endless once you use hashes. for example, each machine can hash it's MAC address plus a counter, etc etc. updated the answer. – andrew cooke Aug 30 '11 at 13:33
  • So, md5 a local counter + something that identifies the machine? How would that compare to using a RNG (Mersenne twister, etc)? – fparadis2 Aug 30 '11 at 13:37
  • a prng might be slightly faster, but you need to worry more about the implementation - it needs to have sufficient state and a long enough period. hashes already have the properties you need as simple guarantees without needing to investigate exactly how they are implemented. – andrew cooke Aug 30 '11 at 13:41
  • I like this answer and upvoted it, but it would probably be much simpler and fast enough just to use /dev/urandom or the Windows equivalent depending on the platform. – President James K. Polk Aug 30 '11 at 15:52
  • oh. for some reason i thought that was excluded by the question. but yes, i agree. post that as an answer - i was adding it to mine then thought it seemed unfair (also, i am not entirely sure how urandom works). – andrew cooke Aug 30 '11 at 16:02
  • What's the equivalent of urandom on Windows? – fparadis2 Aug 30 '11 at 18:11