9

I have a string in Python. I calculate the SHA1 hash of that string with hashlib. I convert it to its hexadecimal representation and take the last 16 characters to use as an identifier:

hash_str = "foobarbazάλφαβήταγάμμα..."
hash_obj = hashlib.sha1(hash_str, encode('utf-8'))
hash_id  = hash_obj.hexdigest()[:16]

My goal is an identifier that provides reasonable length and is unlikely to yield the same hash_id value for a different hash_str input.

If the probability of a SHA1 collision is 1/(2^160), or 1/(16^40), then if I take the last sixteen characters of the hex representation, is the probability of a collision only 1/(16^16)? Or are the bytes (or their hex equivalent) not distributed evenly?

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • 3
    If sha1 is uniformly distributed, then also its "digits". Since sha1 was constructed to be a secure hash function, it should be uniformly distributed or at least very close to it (so close that you cannot see the difference). – AbcAeffchen Nov 06 '15 at 00:22
  • That's not the actual probability of a collision, the real one it's much higher. To know why google "birthday paradox" – Pablo Fernandez Dec 27 '19 at 03:31

1 Answers1

6

Yes. Any hash function which exhibits the property of uniformity has equal chance of any value in its output range being generated by a randomly chosen input value. Therefore, each value of the truncated hash is equally likely too. SHA-1 is is hash function that demonstrates uniformity, therefore your conjecture is true.

abligh
  • 24,573
  • 4
  • 47
  • 84