2

Currently working on a string search program for a uni module, and I've managed to implement the algorithms successfully, at least to the point where they're finding the string consistently. I implemented Boyer Moore and Rabin Karp. I also threw in a Brute Force when one of my coursemates experienced this exact problem, and realised I had the same issue - that brute force is faster than Rabin-Karp on a wordlist.

Rabin-Karp seems to be taking the most amount of time performing the rolling hash, I was curious at first if I was just having a lot of collisions, but I managed to get collisions down to 3 with a huge prime number. That added on a little bit of time I'm assuming due to the size of the prime number, but it seemed pretty clear that rolling hash was causing the problem.

This is the rolling hash section:

//hashes don't match, rehash using rolling hash to move on to next string section
  if (counter < (stringLength - patternLength)) { 

            stringHash = (MAXCHAR *(stringHash - stringFile[counter] * hash) + stringFile[counter + patternLength]) % prime;


            if (stringHash < 0) {

                stringHash += prime;    //when hash value is negative, make it positive
            }

        }

        if (!found) {

            counter++; 
        }

I wanted to try searching through a huge text file so I used the rockyou wordlist, which Boyer Moore is very happy with, and Rabin-Karp is taking under a second. Brute Force is taking less than half the time of Rabin-Karp though which to me just doesn't make sense?

Am I misunderstanding how these algorithms should be applied, or is there a problem with the rolling hash process that I'm using?

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • 2
    Order notation omits the constant. If the constant is high enough for an O(1) it will be slower an O(n) with a low constant until n becomes large enough. Make sure your implementation is correct and you are testing with a large enough dataset. – Richard Critten Dec 01 '19 at 14:33

1 Answers1

1

Brute force string search is the special case of Rabin-Karp with a constant hash function (so every rolling hash matches).

The worst case complexity is the same for both algorithms, as is the average case complexity for most definitions of "average case".

In these situations, Rabin-Karp will take longer due to the overhead of computing and checking a good hash.

The problem with brute force, compared to Rabin-Karp, is that bad cases sometimes occur in real life. If you're searching for pathnames, for example, then it can happen that your pattern has a long prefix in common with many or most of the path names and parts of path names in the file, and that will make brute force take a long time.

With Rabin-Karp, the bad cases are very unlikely to occur in real life. They really only happen under "adversarial" conditions, where the file and pattern are constructed purposefully to take a long time, with specific knowledge of the hash function you use.

Even so... Rabin-Karp is not a great algorithm for single-pattern searching. It becomes much more useful when you're searching for many strings simultaneously and you can look up the rolling hash in a dictionary of potential matches.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Hi Matt, thanks for the answer. This made me go and try with a different dataset to make bruteforce perform badly (I used about 1 million 'a' characters ending with 'ja', and had the pattern search be for 'aaaaaaaaja', and yup Rabin-Karp is a bit better than Brute Force exactly like you said with common prefix. Boyer Moore is worst at this data set- am I right in thinking that this type of search is Boyer Moore worst case? – SweenMachine Dec 01 '19 at 15:16
  • The only really bad cases for Boyer-Moore are when the pattern appears many times. You case is not too bad, but it defeats the optimizations so it can't skip ahead. – Matt Timmermans Dec 01 '19 at 15:27