0

Consider two different strings to be of same length.

I am implementing robin-karp algorithm and using the hash function below:

def hs(pat):
    l = len(pat)
    pathash = 0
    for x in range(l):
        pathash += ord(pat[x])*prime**x # prime is global variable equal to 101
    return pathash
Alex
  • 18,484
  • 8
  • 60
  • 80
  • what you are trying to do is a collision attack and it requires a lot a computation power so you might want to buy a really powerful computer to do this task – Arpit Solanki Aug 13 '17 at 14:20
  • 4
    @ArpitSolanki wut? no. – Marcus Müller Aug 13 '17 at 14:20
  • yes it does. He have to compute hashes for like millions of strings to verify that it is not giving same hash for a pair of two different strings – Arpit Solanki Aug 13 '17 at 14:21
  • 1
    @ArpitSolanki no, you don't. That is not a suitable hashing algorithm above. We can quickly decompose the number you're getting into powers of 101; there's no irreversible operation here. (assuming `ord(c)` < 101) – Marcus Müller Aug 13 '17 at 14:24
  • @ArpitSolanki currently I just wanted to ensure the correctness of function as of now. though, collision must be avoided. – nirav bharadiya Aug 13 '17 at 14:24
  • 1
    @niravbharadiya collision **cannot** be avoided. that's the effect of the fact that your hash is shorter than your string. Some strings must have the same hash, otherwise it's not a hash. – Marcus Müller Aug 13 '17 at 14:25
  • 1
    You might be able to avoid collisions if you restrict the characters in the input (limit it to ASCII for example). Without any restriction like this, it's easy to find collisions: `hs('f\1') == hs('\1\2')` for example. – Aran-Fey Aug 13 '17 at 14:32
  • FWIW, your function is reversed relative to the one in the Wikipedia article on [Rabin–Karp](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm#Hash_function_used), i.e. with `prime=101`, `hs("rba")` returns 999509. You can make it more efficient, and get the same order as Wikipedia, by changing the core computation to `pathash = pathash * prime + ord(b)`. Also, if you make `prime` local to the function (eg a default parameter, or just a hard-coded constant) it will be faster than using a global. – PM 2Ring Aug 13 '17 at 15:13
  • (cont) If you're doing this in Python 3 you should probably be using byte strings, not text strings. And that also lets you eliminate the `ord()` function call. – PM 2Ring Aug 13 '17 at 15:15

1 Answers1

3

It's a hash. There's, by definition, no guarantee there will be no collisions - otherwise, the hash would have to be as long as the hashed value, at least.

The idea behind what you're doing is based in number theory: powers of a number that is coprime to the size of your finite group (which probably the original author meant to be something like 2^N) can give you any number in that finite group, and it's hard to tell which one these were.

Sadly, the interesting part of this hash function, namely the size limiting/modulo operation of the hash, has been left out of this code – which makes one wonder where your code comes from. As far as I can immediately see, has little to do with Rabin-Karb.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • 1
    This hash function is not bounded. If the individual characters of the argument are restricted so that 0<=pat[i]<101 then there are indeed no collisions, and this hash function is simply a base 101 encoder. – President James K. Polk Aug 13 '17 at 15:46
  • 1
    @JamesKPolk exactly. What's missing is some sort of bounding function – otherwise it's not really a hash, being trivially reversible – Marcus Müller Aug 13 '17 at 15:50