0

I have a question: if we let rolling hash overflow, does it affect the correctness of Rabin-Karp algorithm? Could you give a solid example that the overflow indeed will affect correctness?

That is something like same string e.g. "abcd" will give different hash values when you directly compute from "abcd" or from "eabcd" (hash("eabc") - hash("e") * R^3) * R + hash("d")

hash("abcd") != (hash("eabc") - hash("e") * R^3) * R + hash("d") if we allow int/long overflow

maplemaple
  • 1,297
  • 6
  • 24
  • There isn't really a "standard" hash function to use with Rabin-Karp, and since your question is really about one specific hash function, you should specify which hash function that is. – Matt Timmermans Jul 15 '20 at 00:32

2 Answers2

1

In the case of using unsigned integers for rolling hash, unsigned overflow is equivalent to modding by 2^32 or 2^64, depending on the size of the unsigned type. So the answer to your question is yes, the algorithm will still be correct. (As an exercise, think about why will unsigned overflow be equivalent to modding?)

In fact, you will see in many speedy implementations, they don't explicitly use modulo operations and use unsigned overflow as an implicit modulo operation for speed; as an example, see the sample implementation in C by Charras and Lecroq: https://www-igm.univ-mlv.fr/~lecroq/string/node5.html

Still, the modulo operation is retained in pseudocode presentation simply because it is best to make such an operation explicit when presenting the algorithm for both ease of understanding and attention to detail.

BearAqua
  • 518
  • 4
  • 15
0

I don't think it will affect the correctness of the algorithm, since two equal inputs will return the same output when submitted to the same function. As the rolling hash adds and subtracts elements, it shouldn't affect each individual result, even if it overflows.

Daniel
  • 7,357
  • 7
  • 32
  • 84
  • Therefore why the standard Rabin-Karp will require modulo operation to prevent overflow? The frequent modulo operations are quite expensive. – maplemaple Jul 15 '20 at 00:05
  • Just because of a preference to deal with non-negative numbers. Sometimes the values are used to be indexes of an array and you can't have negative indexes. It depends on the case. If it is exclusively for string matching, allowing the hash to overflow seems acceptable. – Daniel Jul 15 '20 at 00:17