11

I'm looking for a Perl string checksum function with the following properties:

  • Input: Unicode string of undefined length ($string)
  • Output: Unsigned integer ($hash), for which 0 <= $hash <= 2^32-1 holds (0 to 4294967295, matching the size of a 4-byte MySQL unsigned int)

Pseudo-code:

sub checksum {
    my $string = shift;
    my $hash;
    ... checksum logic goes here ...
    die unless ($hash >= 0);
    die unless ($hash <= 4_294_967_295);
    return $hash;
}

Ideally the checksum function should be quick to run and should generate values somewhat uniformly in the target space (0 .. 2^32-1) to avoid collisions. In this application random collisions are totally non-fatal, but obviously I want to avoid them to the extent that it is possible.

Given these requirements, what is the best way to solve this?

knorv
  • 49,059
  • 74
  • 210
  • 294
  • You want to avoid collisions with all possible strings, but only have a 4 billion possible digests? Why is using an integer important? How about just using something like MD5, even if you have to store the digest as a string? – brian d foy Dec 23 '09 at 02:20
  • 1
    "You want to avoid collisions with all possible strings" - No, as stated in the question I simply "want to avoid them to the extent that it is possible". – knorv Dec 23 '09 at 11:38
  • "Why is using an integer important?" - As stated in the question the the checksum will be stored in "a 4-byte MySQL unsigned int". – knorv Dec 23 '09 at 11:39

3 Answers3

14

Any hash function will be sufficient - simply truncate it to 4-bytes and convert to a number. Good hash functions have a random distribution, and this distribution will be constant no matter where you truncate the string.

I suggest Digest::MD5 because it is the fastest hash implementation that comes with Perl as standard. String::CRC, as Pim mentions, is also implemented in C and should be faster.

Here's how to calculate the hash and convert it to an integer:

use Digest::MD5 qw(md5);
my $str = substr( md5("String-to-hash"), 0, 4 );
print unpack('L', $str);  # Convert to 4-byte integer (long)
rjh
  • 49,276
  • 4
  • 56
  • 63
  • 4
    B::hash also comes with core perl, uses the internal core hash function, is faster than MD5 and returns an hexified 32-bit integer. But not as secure as MD5. – rurban Apr 08 '15 at 08:00
5

From perldoc -f unpack:

        For example, the following computes the same number as the
        System V sum program:

            $checksum = do {
                local $/;  # slurp!
                unpack("%32W*",<>) % 65535;
            };
Randal Schwartz
  • 39,428
  • 4
  • 43
  • 70
  • This 32bit sums of all bits is a very bad hash value for random distributions. Any hash function is better, even the most simple ones. – rurban Apr 08 '15 at 08:04
  • Sure, but that's the same problem that the System V `sum` program has. See the paragraph. Or are you arguing that `sum` is arguably broken? In that case, it's not about Perl. – Randal Schwartz Apr 09 '15 at 02:29
  • `sum` is about as quick as you'll get, though as noted above, it isn't terribly robust. You can improve it slightly by using the size, e.g. `$_ = <>; unpack("%32W*",$_)%65535 . length($_)`. Anything that needs to be more robust should use `Digest::MD5` or `Digest::SHA`, etc. – Adam Katz Sep 24 '15 at 18:06
4

Don't know how quick it is, but you might try String::CRC.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Pim
  • 1,049
  • 6
  • 4