1

For my implementation of the minhashing algorithm I need to make many random permutations of integers, which will be simulated by using random hash functions (as many as possible). Currently I use hash functions of the form:

h(x) = (a*x + b) % c

where a and b are randomly generated numbers, and c is a prime number bigger than the highest value of b. Anyways, the code runs way too slow and it is impossible to use more than 15 of such hash functions in reasonable running time. Can anyone recommend other ways of using random hash functions for integers in Python? In other posts I came across suggestions for using bitwise shuffling and an XOR operation, but I didn't fully understand how one should implement something like this (I'm relatively new to Python).

Keyb0ardwarri0r
  • 227
  • 2
  • 10
  • Show your code. Can't help you if we don't know how you implemented the solution you're dissatisfied with. Alternatively, if you're just asking for suggestions for off-site libraries or resources, that's explicitly off-topic for StackOverflow. – pjs Oct 17 '16 at 19:43
  • To make the code you have much faster, fix c at a power of two and ensure that a is always odd. This ensures that a and c are co-prime (maximizing the number of possible unique results) and that the modulo operation can be done efficiently with boolean arithmetic. – sh1 Oct 17 '16 at 20:55

1 Answers1

0

Borrowing from my answer to a similar question, and having a quick look at Python documentation to try to guess valid syntax...

The code you posted is OK but it's probably subject to being computed in longer precision than is optimal, and it involves a division which also makes things slow.

To make it faster, you can fix c at a power of two, and you can use binary & (and) instead of modulo, which gives you this:

h(x) = (a * x + b) & ((1 << 32) - 1)

which is the same as:

h(x) = (a * x + b) & (4294967296 - 1)

which is the same as:

h(x) = (a * x + b) % 4294967296

and you must ensure that a is an odd number (this is all that's needed to make it co-prime with c when c is a power of two). This example limits the output range to a 32-bit integer. You can change that as you see fit. I don't know what Python's limits are.

If you want more parameterisation than that, or you discover that the results aren't "random" enough (it would fail statistical tests very quickly, but that usually doesn't matter), then you can add more operations; but you can't add more of those operations because a chain of adds and multiplies will always simplify to just one pair of add and multiply, so the extra operations wouldn't fix anything.

What you can do instead is to use bit shifts and exclusive-or to break up the linearity; like so:

def h(x):
  x = x ^ (x >> 16)
  x = (a * x + b) & ((1 << 32) - 1)
  x = x ^ (x >> 16)
  x = (c * x + d) & ((1 << 32) - 1)
  x = x ^ (x >> 16)
  return x

You can experiment with variations on that if you want. If you set b and d to zero and change the middle 16 to 13 then you get the MurmurHash3 finaliser construction, which is near enough to ideal for most purposes provided you pick good a and c (sadly they can't just be random).

Community
  • 1
  • 1
sh1
  • 4,324
  • 17
  • 30