2

Most applications, especially databases, can sort and filter by small integers or floats much faster than they can do string comparisons.

Therefore I'm wondering if there is a hashing function that I can use to return a 32bit or 64bit number of a short string (about 5 - 40 characters) so that I can compare by integer instead of by string.

I first thought of crc32, but it seems it's much too small of a number and would result in possible collisions in less than 50,000 hashes (I need to do over a million).

I'm mostly interested in working in Python, PHP, V8 Javascript, PostgreSQL, and MySQL.

Community
  • 1
  • 1
Xeoncross
  • 55,620
  • 80
  • 262
  • 364

1 Answers1

2

The problem that collisions become likely at 50k entries is inherent in all 32 bit hashes. If you read a bit on the Birthday problem you'll see that collisions become likely if you have around sqrt(HashSpace) elements, e.g. sqrt(2^32) = 64k for 32 bit hashes.


With 64 bit hashes collisions become much rarer. But I still don't feel too comfortable betting the correctness of my program on that.

Using an approximation from wikipedia:

We obtain a probability of 3*10-8 for 1 million elements, and 3*10-6 for 10 million elements.

You could use CRC64 for that. Or just truncate a crypto hash, such as md5 or sha1 to the desired length.


If a malicious person can choose the strings, breaking your program by deliberately creating collisions, you should at least switch to a keyed hash, such as HMAC.


Depending on what you're doing, you could also simply create an in-memory mapping between string and int where you simply increment a counter for each element you encounter. This gives you a perfect mapping without risk for collisions, but is only applicable in some scenarios.

Community
  • 1
  • 1
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • A %0.000003 probability of a collision with 10 million elements? Sounds like it's worth trying to seeing if I come up with any collisions. I found [this *untested* crc64 PHP function](http://www.php.net/manual/en/function.crc32.php#106216) which might work. I would use a counter to increment a number manually, but the only input I have is the word which I need to convert to the same number each time. I guess I could do a lookup of the word=number and *then use the number*. – Xeoncross Mar 16 '12 at 20:39