11

Take this hash for example:

ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad

It's too long for my purposes so I intend to use a small chunk from it, such as:

ba7816bf8f01
ba7816bf

Or similar. My intended use case:

  • Video gallery on a website, represented by thumbnails. They are in random order.
  • They play in the lightbox. They don't have a unique ID, only their URL is unique.
  • While the lightbox is open I add something to the end of the page URL with JS History API.

//example.com/video-gallery/lightbox/ba7816bf8f01

  • The suffix needs to be short and simple, definitely not a URL.
  • People share the URL.
  • The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
  • Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.

I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?

I got the idea from URLs and Hashing by Google.

  • 1
    you could use a function that produces a shorter hash like md5... or should be fine just truncating your sha256. You increase the chance of collision from almost impossible to slightly less almost impossible. – avigil Mar 11 '18 at 19:10
  • maybe also consider base64 encoding to increase the number of bits encoded in your id. I see this done all the time- look at image links on imgur – avigil Mar 11 '18 at 19:18
  • I've thought about that. Why is `base_convert("ba7816bf8f01", 16, 36);` shorter (when it uses less possible characters) than `base64_encode("ba7816bf8f01");`? – Firsh - justifiedgrid.com Mar 11 '18 at 19:47
  • 3
    that is the string representation of those bytes in hex. A hex character is 4 bits but the string representation of that character is 16 bits. You need to base64 encode the bytes, not the string representing those bytes – avigil Mar 11 '18 at 22:40
  • well... depends on what you are using. In java char is 16 bits, in C its 8, but in any case its more than the underlying bytes you are representing. – avigil Mar 11 '18 at 22:48
  • Consider base58 encoding instead of base64. It has a number advantages when generating human-readable identifiers. – Nick Moore Nov 24 '22 at 21:07

1 Answers1

9

The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).

Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.

Ry-
  • 218,210
  • 55
  • 464
  • 476