Selecting parameters for string hashing

Question

I was recently reading an article on string hashing. We can hash a string by converting a string into a polynomial.

H(s1s2s3 ...sn) = (s1 + s2*p + s3*(p^2) + ··· + sn*(p^n−1)) mod M.

What are the constraints on p and M so that the probability of collision decreases?

A good requirement for a hash function on strings is that it should be difficult to find a pair of different strings, preferably of the same length n, that have equal fingerprints. This excludes the choice of M < n. Indeed, in this case at some point the powers of p corresponding to respective symbols of the string start to repeat.
Similarly, if gcd(M, p) > 1 then powers of p modulo M may repeat for exponents smaller than n. The safest choice is to set p as one of the generators of the group U(ZM) – the group of all integers relatively prime to M under multiplication modulo M.

I am not able to understand the above constraints. How selecting M < n and gcd(M,p) > 1 increases collision? Can somebody explain these two with some examples? I just need a basic understanding of these.

In addition, if anyone can focus on upper and lower bounds of M, it will be more than enough. The above facts has been taken from the following article string hashing mit.

score 1 · Accepted Answer · answered Jun 16 '16 at 19:57

The "correct" answers to these questions involve some amount of number theory, but it can often be instructive to look at some extreme cases to see why the constraints might be useful.

For example, let's look at why we want M ≥ n. As an extreme case, let's pick M = 2 and n = 4. Then look at the numbers p⁰ mod 2, p¹ mod 2, p² mod 2, and p³ mod 2. Because there are four numbers here and only two possible remainders, by the pigeonhole principle we know that at least two of these numbers must be equal. Let's assume, for simplicity, that p⁰ and p¹ are the same. This means that the hash function will return the same hash code for any two strings whose first two characters have been swapped, since those characters are multiplied by the same amount, which isn't a desirable property of a hash function. More generally, the reason why we want M ≥ n is so that the values p⁰, p¹, ..., p^n-1 at least have the possibility of being distinct. If M < n, there will just be too many powers of p for them to all be unique.

Now, let's think about why we want gcd(M, p) = 1. As an extreme case, suppose we pick p such that gcd(M, p) = M (that is, we pick p = M). Then

s₀p⁰ + s₁p¹ + s₂p² + ... + s_n-1p^n-1 (mod M)

= s₀M⁰ + s₁M¹ + s₂M² + ... + s_n-1M^n-1 (mod M)

= s₀

Oops, that's no good - that makes our hash code exactly equal to the first character of the string. This means that if p isn't coprime with M (that is, if gcd(M, p) ≠ 1), you run the risk of certain characters being "modded out" of the hash code, increasing the collision probability.

That's a very beautiful explanation with nice and easy examples. Thanks! — Shivam Mitra, Jun 16 '16 at 20:04

Tony Delroy · Answer 2 · 2016-06-17T13:50:24.267

How selecting M < n and gcd(M,p) > 1 increases collision?

In your hash function formula, M might reasonably be used to restrict the hash result to a specific bit-width: e.g. M=2¹⁶ for a 16-bit hash, M=2³² for a 32-bit hash, M=2^64 for a 64-bit hash. Usually, a mod/% operation is not actually needed in an implementation, as using the desired size of unsigned integer for the hash calculation inherently performs that function.

I don't recommend it, but sometimes you do see people describing hash functions that are so exclusively coupled to the size of a specific hash table that they mod the results directly to the table size.

The text you quote from says:

A good requirement for a hash function on strings is that it should be difficult to find a pair of different strings, preferably of the same length n, that have equal fingerprints. This excludes the choice of M < n.

This seems a little silly in three separate regards. Firstly, it implies that hashing a long passage of text requires a massively long hash value, when practically it's the number of distinct passages of text you need to hash that's best considered when selecting M.

More specifically, if you have V distinct values to hash with a good general purpose hash function, you'll get dramatically less collisions of the hash values if your hash function produces at least V² distinct hash values. For example, if you are hashing 1000 values (~2¹⁰), you want M to be at least 1 million (i.e. at least 2*10 = 20-bit hash values, which is fine to round up to 32-bit but ideally don't settle for 16-bit). Read up on the Birthday Problem for related insights.

Secondly, given n is the number of characters, the number of potential values (i.e. distinct inputs) is the number of distinct values any specific character can take, raised to the power n. The former is likely somewhere from 26 to 256 values, depending on whether the hash supports only letters, or say alphanumeric input, or standard- vs. extended-ASCII and control characters etc., or even more for Unicode. The way "excludes the choice of M < n" implies any relevant linear relationship between M and n is bogus; if anything, it's as M drops below the number of distinct potential input values that it increasingly promotes collisions, but again it's the actual number of distinct inputs that tends to matter much, much more.

Thirdly, "preferably of the same length n" - why's that important? As far as I can see, it's not.

I've nothing to add to templatetypedef's discussion on gcd.

Selecting parameters for string hashing

2 Answers2