0

I've implemented the Adler32 rolling hash in PHP, but because ord is so slow (about 1MB per second on my dev machine) to get the integer values of chanters in a string, this solution is unworkable for 100MB+ files.

PHP's mhash function can get a very quick calculation for the adler32 (120MB per second on my dev machine). However mhash doesn’t seem to support the rolling nature of adler32, so you have to calculate a whole new adler32 as the rolling window moves rather than just recalculate the hash for the two bytes which have actually changed.

I'm not tied to the adler32 algorithm, I just need a very fast rolling hash in PHP.

Dom
  • 2,980
  • 2
  • 28
  • 41

2 Answers2

1

Call the low two bytes of the Adler-32 A and the high two bytes B, where that is the Adler-32 of the sequence {x1, x2, ..., xn}.

To get the Adler-32 of {x2, ..., xn}, subtract x1 from A, modulo 65521, and subtract n * x1 + 1 from B, again modulo 65521.

Note that if your window size n happens to be a multiple of 65521, then you can just subtract one from B (modulo 65521). So that might be a good window size to pick, if you can. Also note that if n is larger than 65521, then you can multiply x1 by (n modulo 65521) instead. If n is a constant, you can do that modulo operation ahead of time.

(Note that % operator in C and PHP is not the modulo operation, but rather the remainder operation. So you need to take care with negative numbers.)

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Hi Mark, thanks so much for writing an answer. However my problem was not in the implementation of your(?) algorithm (I've already done that), but in implementing it quickly in PHP. Getting they bytes from a PHP string to operate on seems to be a slow process, about 1MB per second. The internal mhash implementation of adler32 clearly reads the bytes from the string about three orders or magnitude quicker, but it doesn't provide any way to make use of the rolling nature of the algorythm. – Dom Jun 15 '15 at 08:34
  • Plus one for the very helpful tips about the modulo. – Dom Jun 15 '15 at 08:35
  • My answer solves the issue you stated: "so you have to calculate a whole new adler32 as the rolling window moves rather than just recalculate the hash for the two bytes which have actually changed." You do not have to calculate a new Adler-32 for the whole window, but rather just update it for the bytes dropped off the start and and added to the end. I can't do anything about PHP's internal implementations. – Mark Adler Jun 15 '15 at 13:27
  • I stated that I've implemented the algorithm but ord is to slow to make it workable. I then went on to say that the mhash implementation is much faster but not rolling. Your solutions just pushes me right back to ord, which as I stated at the start is too slow. I really do appreciate your help, but your solution does not solve the stated problem. – Dom Jun 15 '15 at 18:04
0

With unpack you can get an array of integer values of characters as an array. Note that the index starts from 1, not 0.

Example:

$contents = "addadda";
$ords = array_values(unpack("C*", $contents)); // make 0-based array 
$a = 1; $b = 0; // hash low and high words
$len = 4; // the window length
foreach ($ords as $i => $ord) {
    if ($i < $len) {
        $a = ($a + $ord) % 65521;
        $b = ($b + $a) % 65521;
    } else {
        $removed = $ords[$i - $len];
        $a = ($a + $ord - $removed + 65521) % 65521;
        $b = ($b + $a - 1 - $len * $removed + 65521) % 65521;
    }
    if ($i >= $len - 1) {
        echo $i - $len + 1, "..", $i, ": ",
            substr($contents, $i - $len + 1, $len), " => ",
            $b * 65536 + $a, "\n";
    }
}

Result:

0..3: adda => 64815499
1..4: ddad => 65405326
2..5: dadd => 65208718
3..6: adda => 64815499