7

I've got a huge table with something like 8 300 000 rows (won't be edit nor delete ever).

My first column look something similar P300-4312B_X16_S and the entry isn't unique so I use a regular INDEX on this field.

However, MySQL is WAY faster using a binary field instead of a varchar so I encode my INDEX in MD5 using BINARY(16) to store the data.

This morning, I've started to use CRC32 for the first time and I saw that CRC32 can be output as an hexadecimal string using 8 characters.

My question : If I use a CRC32 instead of a MD5, it will be faster. However, when CRC32 is ran over let's say 2 000 000 unique value, the result will be unique or maybe sometime I'll have twice the same string for two differents string ? I ask that because the result is only 8 characters (32b) long instead of 32(128b) like the MD5.

Thanks.

David Bélanger
  • 7,400
  • 4
  • 37
  • 55
  • please take a look to this page: http://www.dslreports.com/forum/remark,13525942 – jcho360 Oct 01 '12 at 18:31
  • 1
    Of course you will get more collisions with CRC32. It is a tool for data integrity check, not a hash function like md5. Hash functions are designed to produce as little collisions (the same results for different input) as possible. CRC is not. – dmitry Oct 01 '12 at 18:33
  • `However, MySQL is WAY faster using a binary field instead of a varchar so I encode my INDEX in MD5 using BINARY(16) to store the data.` It sounds like your indexes are broken. Indexing over a `VARCHAR` should work fine.. – Brendan Long Oct 01 '12 at 22:17
  • For dmitry, getting more collisions with crc32 as compared to md5 has almost nothing to do with the design and everything to do with the number of bits. A crc32 will result in the same number of collisions as any other good 32-bit hash function. Similarly, a 128-bit crc will result in the same collision probability as md5. md5 has another design requirement beyond crc that it not be reversible for use in cryptographic applications. That property has no impact on accidental collisions. All it does is prevent, or rather make very difficult, manufactured collisions. – Mark Adler Oct 02 '12 at 06:14
  • @Mark Adler Can't agree with "nothing to do with design". Md5 is hash algorithm by design. Crc is check sum designed to detect bit errors and collisions in that context are just off of the domain. – dmitry Oct 02 '12 at 07:24
  • Why do you not just use a *_bin-collation? – Jimmy T. Mar 11 '14 at 19:45

1 Answers1

10

The expected number of collisions is the number of pairs over the number of possible check values. So for 2,000,000 values there are (2000000 * 1999999) / 2 pairs, which is about 2x1012. For a 32-bit CRC, the expected number of collisions is that over 232, which is 466. So you are essentially guaranteed to have collisions in that case.

For a 128-bit MD5 check value, the expected number of collisions is about 6x10-27. For small values of the expected number, that is also the probability of one collision.

If it is important to you to have a very low probability of a collision, then you need to pick something other than a CRC-32.

You don't need the overhead of MD5 though, where its cryptographic strength is unimportant for your application. You don't really care if someone malicious can find a way to fabricate an entry with the same check value as another entry. Therefore you could use a 64-bit non-cryptographic hash designed for that purpose, which would run much faster and would give a 10-7 probability of a collision in your case of 2,000,000 values. Or you could use a 128-bit non-cryptographic hash and get the same probability as for MD5, but much faster. Take a look at the CityHash family of hash algorithms.

Note however that in all cases the probability of a collision is not zero. You should consider the consequences of a collision to your code.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I like your answer because I now understand the logic behind the "hash". I don't care if the visitor find the encoded hash, it's only to define a bus trip. If he find it then he will find a random bus trip... no big deal. I'll take a look at the CityHash family. Thanks. – David Bélanger Oct 02 '12 at 13:26