I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions. Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.
Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1
, where K is the number of suggested hash functions.
In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214
, or 1 in 23,689 (23k).
I tried implementing the optimization in python as follows:
import uuid
import hashlib
import mmh3
m = 4194304 # 0.5 MiB in bit
k = 15
bit_vector = [0] * m
data = str(uuid.uuid4()).encode()
hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)
for _ in range(0, k-1):
hash1 += hash2
bit = hash1 % m
bit_vector[bit] = 1
But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).
What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?
I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.
*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.