0

bloom filter are amazing when inputs are purely random. If we know the bloom filter size and fpp, we can very easily derive the number of hashfunction used. By Default all the libraries use some hashing algorithm for example guava uses murmur hash. But these hashing function(murmur hash in particular) are not cryptographic hash function. Hence if someone knows

  1. total number of elements and fpp for which bloom was created
  2. Some elements that already existing in the bloom

It might be theoretically possible to generate other elements which were not inserted in bloom filter but bloom filter will say elements are present.

What are the strategies to prevent this from happening? Few things that come to mind

  1. use cryptographic hash function as mentioned in this answer
  2. Randomize the number expected number of elements to be inserted so that total number of hash function becomes tough to guess.

Any other solutions?

best wishes
  • 5,789
  • 1
  • 34
  • 59
  • perhaps fingerprint-based filters like a cuckoo filter? There is a lot of literature on data sketches so I am sure you can find something. – Ben Manes Mar 29 '22 at 18:37
  • @BenManes The problem of full hash collisions would remain even with cuckoo filters or other fingerprint-based filters like quotient filters. – jbapple Apr 21 '22 at 14:43
  • @jbapple thanks! I haven't looked at them enough to grok their approaches. When I've used a count-min sketch, I added jitter in the algorithm to protect from hash flooding ([1](https://github.com/ben-manes/caffeine/blob/f52062955adb749e7828d1f1cff12a9eea26bee4/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L157-L165), [2](https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsUUBQfAJ2c2NsFPNPr2qymIBs/edit#slide=id.g833e19c165_1_155)). The underlying hash table also has protection by degrading to a red-black tree. Gracefully degrade but stay correct. – Ben Manes Apr 21 '22 at 15:38
  • @BenManes is there a video link for the 2nd presentation you have attached? It seems like an interesting talk. – best wishes Apr 22 '22 at 01:21
  • @bestwishes Unfortunately not. This was part of a Usenix Fast 2020 [tutorial session](https://www.usenix.org/conference/fast20/presentation/afternoontutorial1) but unlike research papers they did not make the material public. It was just prior to the pandemic lockdown and I rarely give presentations. For the cache there are a [few articles](https://github.com/ben-manes/caffeine#in-the-news) but they don't cover hash flooding since that scares people. – Ben Manes Apr 22 '22 at 01:32
  • @bestwishes here is the [full slide deck](https://drive.google.com/file/d/1oHV8SjrdhSdwgZ6jegJh8Ysrbnzxvlfk/view?usp=sharing) if of interest. – Ben Manes Apr 22 '22 at 01:35
  • Interesting question. I haven’t seen much work done in making these data structures cryptographically secure. One question that comes to mind is what data the attacker would have access to. Do they get to see the actual bits making up the Bloom filter? Can they measure the time effect of a query? – templatetypedef Sep 17 '22 at 07:45

0 Answers0