2

I wrote a simple step-by-step implementation of Rabin-Karp algorithm for substring search, and it seems to work fine until the hash becomes greater than the modulus, and then it goes wrong...

Here is the code, it's quite simple:

typedef long long ll;

#define B 257
//base
#define M 2147483647
//modulus

//modulus for positive and negative values
ll mod(ll a){
    return (a % M + M) % M;
}

//fast way to calculate modular power
ll power(ll n, ll e){
    ll r = 1;
    for(; e > 0; e >>= 1, n = (n*n) % M)
        if(e&1) r = (r * n) % M;
    return r;
}

//function to calculate de initial hash
//H(s) = s[0] * B^0 + s[1] * B^1 + ...
ll H(char sub[], int s){
    ll h = 0;
    for(ll i = 0; i < s; i++)
        h = mod(h + mod(power(B, i) * sub[i]));
    return h;
}

//brute force comparing when hashes match
bool check(char text[], char sub[], int ini, int s){
    int i = 0;
    while(text[ini + i] == sub[i] && i < s) i++;
    return i == s;
}

//all together here
void RabinKarp(char text[], char sub[]){
    int t = strlen(text), s = strlen(sub);
    ll hs = H(sub, s), ht = H(text, s);
    int lim = t - s;

    for(int i = 0; i <= lim; i++){
        if(ht == hs)
            if(check(text, sub, i, s))
                printf("MATCH AT %d\n", i);           

        ht -= text[i];      
        ht /= B;            
        ht = mod(ht + power(B, s - 1) * text[i + s]);

        //we had    text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)

        //then    text[i+1] * B^1 + text[i+2] * B^2 + ... + text[i + len - 1] * B^(len-1)
        //then    text[i+1] * B^0 + text[i+2] * B^1 + ... + text[i + len - 1] * B^(len-2)
        //finally we add a new last term text[i + len] * B^(len-1)

        //so we moved the hash to the next position
    }
}



int main(){
    char text[] = "uvauvauvaaauva";
    char sub[] = "uva";
    char sub2[] = "uvauva";
    RabinKarp(text, sub);
    printf("----------------------------\n");
    RabinKarp(text, sub2);
}

The problem is that after I take the modulus, the hash can become a small number and then, when I add some big factor to it, the hashes may not match even when they should.

For example: abc inside xabc

when I take the hash of abc and xab, suppose both of them are bigger than the modulus, so they get small after the modulus operation.

Then, when I remove 'x' and add the 'c' factor, the sum can be smaller than the modulus but still big, so it won't match.

How can I overcome this problem?

Daniel
  • 7,357
  • 7
  • 32
  • 84
  • Run tests on your functions to make sure they're working the way you intend - almost certain that there is a bug or logical error in your modulo mathematics – Nick Zuber May 10 '16 at 23:55
  • suppose my mod is 2550 and base 50. If im searching aaa, it should be 1 + 50 + 2500 = 2551 % 2550 = 1. if my string is baaa, the first hash would be 2 + 50 + 2500 = 2552 % 2550 = 2, then when I subtract 2 and then i divide for 50, it stays on 0, and when i sum 2500, my hash will be 2500 when it should be 1 – Daniel May 11 '16 at 00:30

1 Answers1

2

ht /= B; is not plausible. First of all because you are doing arithmetic mod M, and the modular equivalent of division is not the same as the standard one. Secondly because you should expect the same answer for x and x + M and this will not be the case.

You have text[i] * B^0 + text[i+1] * B^1 + ... + text[i + len - 1] * B^(len-1)

If you work with

text[i] * B^(len-1) + text[i+1] * B^(len - 2) + ... + text[i + len - 1] * B^0

You can subtract off text[i] * B^(len-1) and then multiply by B instead

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Yeah on the part about the division not being plausible I remember about needing to do something with modular inverse instead (correct me if I'm wrong) – Nick Zuber May 11 '16 at 15:43
  • You could use the modular inverse but you don't have to. If you reverse the way the polynomial is constructed as in my answer then the effects of the character you want to get rid of are associated with the highest power of B and you can just subtract this off and then multiply what remains by B to shift everything up one place, and then add on the new character. – mcdowella May 11 '16 at 16:51
  • oh I see, interesting approach! – Nick Zuber May 11 '16 at 16:57
  • i calculated the modular inverse of 257 modulus 2147483647 and multiplied instead of divide, but it didnt work :( but with @mcdowella approach it worked fine! thanks!!! – Daniel May 12 '16 at 01:18