0

I was trying to learn how to match a pattern in a given text string using multiple hashing. I have found the following implementation in java:

void multiHashing() {
    int counter = 0;
    int d = 26;
    int r = 10;
    int [] qP = qPrimes(d,r); // stores 10 prime numbers
    long [] P = new long[r];
    long [] T = new long[r];
    long [] H = new long[r];
    for (int k=0;k<r;k++) {
        H[k] = mod(((long) Math.pow(d, m-1)), qP[k]);
        for (int i=0;i<m;i++) {
            P[k] = mod((d*P[k] + ((int)pattern.charAt(i) - 97)), qP[k]); //what has been done here
            T[k] = mod((d*T[k] + ((int)text.charAt(i) - 97)), qP[k]);
        }           
    }
    for (int s=0;s<=n-m;s++) {
        if (isEqual(P,T)) {
            out.println(s);
            counter++;
        }
        if (s < n-m) {
            for (int k=0;k<r;k++)
                T[k] = mod(d*(T[k] - ((int)text.charAt(s) - 97)*H[k]) + ((int)text.charAt(s+m) - 97), qP[k]);       // what has been done here? 
        }

    }
} 

The problem is: I can't understand some lines in the above code which i have commented out in the code. What's actually been done in those lines?

1 Answers1

2

This is the Rabin-Karp string searching algorithm. Instead of comparing pattern to every part of text, this algorithm tries to compare hashed value of those to reduce the calculations.

For calculating hash values it uses rolling hash which maintains a fixed width window (in this case width = length of pattern) of the text and updates it by moving that window one character at a time.

Input: pattern P, text T, d, prime number q

m = P.length
n = T.length
p = 0 // hash of pattern P
t = 0 // hash of text T
h = (d ^ (m-1)) % q 

// preprocessing: hashing P[1..m] and T[1..m] (first window of T)
for i = 1 to m 
    p = (d * p + P[i]) % q //(1)
    t = (d * t + T[i]) % q

// matching
for s = 0 to n-m
    if(p == t)
        if(P[1..m] == T[s+1..s+m]
            print "matched"
    // update the rolling hash
    if(s < n-m)
        t = (d * (t - T[s+1] * h) + T[s+m+1]) % q // (2)

In preprocessing phase, it calculates hash of pattern P and first window of text T. In order to calculate hash of pattern we need to calculate each character's hash. (1) p = (d * p + P[i]) % q actually calculates i-th character's hash value.

Example from Wikipedia:

// ASCII a = 97, b = 98, r = 114.

hash("abr") = (97 × 1012) + (98 × 1011) + (114 × 1010) = 999,509

In matching phase after comparing pattern to s-th window of text (in case hash values of P and s-th window of T are equal) we need to update hash value to represent (s+1)-th window of T. (2) t = (d * (t - T[s+1] * h) + T[s+m+1]) % q first subtracts hash value of first character of last window and then adds hash value of next character and hence moving the window one character forward.

from Wikipedia:

rolling hash function just adds the values of each character in the substring. This rolling hash formula can compute the next hash value from the previous value in constant time: s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

We can then compute the hash of the next substring, "bra", from the hash of "abr" by subtracting the number added for the first 'a' of "abr", i.e. 97 × 1012, multiplying by the base and adding for the last a of "bra", i.e. 97 × 1010. Like so:

           base  old hash  old 'a'     new 'a'

hash("bra") = [101 × (999,509 - (97 × 1012))] + (97 × 1010) = 1,011,309

Remarks:

(int)text.charAt(s) - 97: 97 is ascii code of character 'a', so this operation changes 'a' to 0, 'b' to 1, etc.

Omid
  • 5,823
  • 4
  • 41
  • 50