1

I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions. Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.

Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1, where K is the number of suggested hash functions.

In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214, or 1 in 23,689 (23k).

I tried implementing the optimization in python as follows:

import uuid
import hashlib
import mmh3

m = 4194304 # 0.5 MiB in bit
k = 15

bit_vector = [0] * m

data = str(uuid.uuid4()).encode()

hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)

for _ in range(0, k-1):
    hash1 += hash2

bit = hash1 % m

bit_vector[bit] = 1

But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).

What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?

I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.

*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.

2 Answers2

1

The Kirsch-Mitzenmacher optimization is a proof of concept paper, and as such assumes a bunch of requirements on the table size and the hash functions themselves. Using it naively has stumbled people before you too. There is a bunch of practical stuff to consider. You can check out "6.5.1 Double hashing" of the P.C.Dillinger's thesis, linked in his long comment as well, in which he explains the issues and offers solutions. His solutions to rocksdb implementation issues can also be interesting.

Maybe start by trying the "enhanced double hashing":

hashes = []
for i in range(0, k-1):
    hash2 += i
    hashes.append(hash1)
    hash1 += hash2

(I don't know why your example code only generates 1 hash as result, and assume you pasted simplified code. 15 hash functions should set 15 bits.)

If this doesn't work, and you are exhausted of reading/trying more solutions - say "it's not worth it" at this point, and choose to find a library with a good implementation instead, compare them for speed/accuracy. Unfortunately, I don't konw any to recommend. Maybe the rocksdb implementation, since we know an expert has worked on it?

And even that may not be worth the effort, and calling mmh3.hash(data, i) with 15 seeds be reasonably fast enough.

1

You are indeed misinterpreting the Kirsch-Mitzenmacher optimization. You are combining the hash1 and hash2 functions, plus all your k functions, into what amounts to a single, more complicated hash function.

The point of the optimization is to use each of these as separate hash functions. You don't merge the k bit results together in any way, you set each of them separately in your bit vector. In other words, you should have something roughly along the lines of:

bit_vector[hash1] = 1
bit_vector[hash2] = 1

for i in range(0, k-1):
    bit_vector[hash1 + i * hash2] = 1

See how I'm setting a bit for every hash function? You're just combining them all. That ruins the whole point of using multiple different "independent" hash functions, which is really necessary for good result from a bloom filter.

nsheff
  • 3,063
  • 2
  • 24
  • 29
  • Is the code you provided correct? I was checking the original paper Bloom Filters in Probabilistic Verification and the algorithm is slightly different. It does not set the bit pointed by the second hash but instead it is started to be composed for the second index. For the first index (0), the formula outputs hash1 anyway. In your example there are k+1 bits set per adition. – Harry Sep 24 '22 at 20:24
  • I don't understand what you mean by 'instead it is started to be composed for the second index' – nsheff Oct 02 '22 at 14:56
  • Your code sets `k + 2` bits where the `k` is a number of hash functions. [This](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.2422&rep=rep1&type=pdf) paper sets exactly `k` bits: ```f[0] = hash1; for i in range(1, k-1) { hash1 = (hash1 + hash2) % n; f[i] = hash1;}``` Array `f` contains indexes of bits to set in `bit_vector`. – Harry Oct 03 '22 at 18:52