3

I am trying to implement the Rabin-Karp for finding the substring; and I got stuck at the rolling hash(trying to use the formula suggested in Wikipedia).

#define MOD 1000000007
unsigned long long rolling_hash(const char *str)
{
        unsigned long long hash = 0;
        size_t str_len = strlen(str);
        for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                hash = hash + str[i] * pow(257, k);
        //      hash = hash % MOD;
        }
        return hash;
}

int main(void)
{
        printf("%llu\n", rolling_hash("TestString"));
        printf("%llu\n", rolling_hash("estStringh"));
        unsigned long long old = rolling_hash("TestString");
        // Add a character to the end
        // since the last char in old was multiplied by 1, now multiply it by
        // the base and then add the _new_ character to the end
        old = old * 257 + 'h';
        //old = old % MOD;
        // Remove a char from the start
        // Simply, remove the hash value of the first character
        old = old - 'T' * pow(257, 10);;

        printf("\n%llu\n", old);
        return 0;
}

The code above works perfectly fine as long as I do not introduce any remainder operations; once I uncomment my % operations, things break down and the answer I get from the changes over the rolling hash won't equal that which's being printed by the second print.

janisz's answer:
The suggestion to change the hash generator as in janisz's answer got the remainder to work when adding new characters but NOT when removing the old ones.
Note: I am using my own pow function to work with unsigned long long

Fingolfin
  • 5,363
  • 6
  • 45
  • 66
  • What is wrong with the commented line? `%` really means modulus. – Palec Dec 05 '13 at 22:57
  • @Palec: Please check the edited question. (Side note: % really means remainder, not modulus) (http://stackoverflow.com/questions/9284644/why-is-operator-referred-to-as-the-modulus-operator-instead-of-the-remainder) – Fingolfin Dec 06 '13 at 09:56

1 Answers1

2

Hash genrator code is wrong. It should be

hash = (hash*257 + str[i]) % MOD;

and unncoment old_hash = old_hash % MOD;. Also change the way you generate new hash from previous

(old_hash - to_delete_char * pow(257, str_len-1)) % MOD;

Take a look at your code. First 2 lines are perfectly good. What happen in the loop. First of all you are doing as much multiplies as you can. In my approach I use Horner scheme of computing hash becouse hash is a polynomial.

Why it works when without modulus and with not. I think it's a coincidence becouse you overflow integer with 8 characters (log(2^64)/log(257) = 8).

Now what's wrong with removing characters. to_delete_char * pow(257, str_len); should be to_delete_char * pow(257, str_len-1); index should start from 0 not 1 to mach your generator.

EDIT: I think problem was in pow function. As I wrote above it overflow just with 8 characters. In your example you have 10 so it can't work.

EDIT: It turns out that adding and removing character must be done as a one operation. Probably due to equivalents but I'm not sure.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define MOD 787

unsigned long long pow(int x, int y)
{
    unsigned long long ret = 1;
    for (int i=0;i<y;i++)
        ret = (ret*x)%MOD;
    return ret;
}
unsigned long long rolling_hash(const char *str)
{
        unsigned long long hash = 0;
        size_t str_len = strlen(str);
        for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                hash = hash + (str[i] * pow(257, k))%MOD;
                hash = hash % MOD;
        }
        return hash;
}

int main(void)
{
        char input[] = "TestString";
        printf("Input: %llu\n", rolling_hash(input));
        printf("Expected: %llu\n", rolling_hash("estStringh"));
        unsigned long long old = rolling_hash(input);
        // Add a character to the end
        // and Remove a char from the start

        unsigned long long  h = (input[0] * pow(257, strlen(input)))%MOD;
        old = ((old * 257) + 'h' - h) % MOD;

        printf("Actual: %llu\n", old);
        return 0;
}
janisz
  • 6,292
  • 4
  • 37
  • 70
  • Fixing the hash generator and uncommenting the line of the mod fixed adding new characters to the string; but can you explain what was wrong with the previous code? As for deleting the character, that's not working. – Fingolfin Dec 05 '13 at 23:14
  • @AdelQodmani try deleting now. Problem was in `pow(257, str_len)`. I'll explain it in answer – janisz Dec 05 '13 at 23:21
  • 1
    You mean Horner’s, not Heron method. Also called Horner scheme. – Palec Dec 05 '13 at 23:47
  • @AdelQodmani is it broken for any input? Can you post more code. I think it could be related to overflow issue – janisz Dec 06 '13 at 09:21
  • @janisz I edited the question a bit to reflect what exactly I was doing; so far what you've suggested didn't work for deleting characters. Another thing, isn't doing as many multiplications as I can a good thing? Yes, slow; but doesn't it lower the chance of collisions? or I got something wrong? And yep, It's broken for any input. – Fingolfin Dec 06 '13 at 09:21
  • @janisz: The example as you edited works indeed; by let's pick a slightly different string: "HestString" instead of "TestString" and try to add 'h' to the end and remove 'H' from the beginning, it breaks down again. – Fingolfin Dec 06 '13 at 09:48
  • Yes it also breaks for different MOD. – janisz Dec 06 '13 at 09:53
  • @AdelQodmani Now it's look like it's working for any MOD and input. I do only small tests. I think problem was because you split generating new hash in to 2 lines and calculating modulus break modular arithmetic rules. – janisz Dec 06 '13 at 10:37
  • @janisz That sounds like the problem; but I am not sure which operation/s broke the modulus arithmetic. – Fingolfin Dec 06 '13 at 11:49