1

I need to generate and store a identifier per row in a distributed database (high write throughput). There are constraints on length of the Id, preferring it to be as small as possible. Id must be in a utf8.

I was considering generating a uuidv4, converting to base16 encoding, removing the hyphens and taking a partial subset of characters, and in the future if we need more characters we take a larger partial subset.

e.g. Uuid = 123e4567-e89b-12d3-a456-426655440000

Subset = 123e4567e89b

Are there foreseeable issues with this?

  • 2
    Dunno. Imagine we printed a phone book using "partial people uuids" - everyone in there's gonna be listed just by their first name. Can we foresee any issues already? – CBroe Jul 08 '18 at 02:50
  • 1
    You are taking the timestamp fields of a v4 uuid. Timestamps are susceptible to systematic collision. – Raymond Chen Jul 08 '18 at 03:05
  • 1
    You can take whatever you want as your key. But the shorter it is, the higher is the possibility of collisions. Especially if the keys are generated in a distributed system. – derpirscher Jul 08 '18 at 03:33
  • Depending on the quality of the random generator, the uuid may not be evenly distributed, which may increase the chances of collisions too. – derpirscher Jul 08 '18 at 03:48
  • @CBroe you're assuming the first part of a uuid is from a smaller subset of possibilities. As i understand it 122 bits of the uuidv4 are pseudo random and 6 bits are invariant, (https://en.m.wikipedia.org/wiki/Universally_unique_identifier?wprov=sfla1) using fewer than 128 bits results in higher probability of collision. I'm just trying to understand consequences of using uuidv4 ve rolling my own – rickyrattlesnake Jul 08 '18 at 06:16

2 Answers2

1

You cannot guarantee that partial UUID’s will be universally unique. Now, depending on the number of UUIDs generated, this might not be an issue - especially if you check for duplicates...but perhaps its better just to write your own ID generator with the length specification that you need. I suppose the actual specification for UUIDs requires a certain number of bits for each to be deemed universally unique, but your requirements limit length. They do not require the use of actual UUIDs.

brianolive
  • 1,573
  • 2
  • 9
  • 19
  • Yeh this makes sense. I do have the ability to check uniqueness using the insert operation on the dB but the fewer collisions the faster the insert. I only need uniqueness in the order of 10^6. random number gens need a seed, so I'm guessing a timestamp to the millisecond would be good enough, I'm not expecting that many writes per sec – rickyrattlesnake Jul 08 '18 at 06:08
0

If your field must be text and length matters, then using base16 only gives you 4 bits per byte whereas base64 gives 6 bits per byte. In other words, the former needs 50% more bytes to achieve the same collision probability as the latter. You could get to ~7 bits per byte by taking advantage of how UTF-8 works, but that's a lot more work (and risk) for a lot less gain.

There is no point in using a truncated UUID, though; you have to use the whole thing or its anti-collision properties don't hold. If you just want a random string, especially when you have the ability to check for collisions, just generate a random number with the desired number of bits (preferably a multiple of 6) and then base64 encode it.

StephenS
  • 1,813
  • 13
  • 19