0

I use a Bloom Filter with a small desired false positive probability (fpp) and get much less result:

    BloomFilter<Long> bloomFilter = BloomFilter.create(Funnels.longFunnel(), 1_000_000, .001);
    int c = 0;
    for (int i = 0; i < 1_000_000; i ++) {
        // can replace with random.nextLong() because 1M random.nextLong() can hardly make collision
        if (!bloomFilter.put(Long.valueOf(i))) {
            // There is no duplicated elements so put returns false means false-positive
            c ++;
        }
    }
    System.out.println(c);

I expect 1000 (1M * 0.001) false positives but the result is 127 (If I use large random numbers the result will also near 120 but not 1000).

=== UPDATE ===

Here is my test:

desired actual    a/d 
0.3     0.12      40%
0.1     0.03      30%
0.03    0.006     20%    (guava's default fpp)
0.01    0.0017    17%
0.003   0.0004    13%
0.001   0.00012   12%
0.0003  0.00003   10%
0.0001  0.000009   9%
0.00003 0.000002   7%
0.00001 0.0000005  5%
auntyellow
  • 2,423
  • 2
  • 20
  • 47

2 Answers2

2

The false positive probability is lower if there are less entries in the filter. In your test, you calculate the probability starting with a set that is empty, and then while adding entries. This is not the right way.

You need to first add 1 million entries to the Bloom filter, and then calculate the false positive probability, for example by checking if entries are in the set that you didn't add.

for (int i = 0; i < 1_000_000; i ++) {
    bloomFilter.put(Long.valueOf(i));
}
for (int i = 0; i < 1_000_000; i ++) {
    // negative entries are not in the set
    if (!bloomFilter.mightContain(Long.valueOf(-(i + 1)))) {
        c++;
    }
}
Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
0

The only guarantee BloomFilter provides is that the true false positive probability is at most the value you set. In some cases, the nature of the Bloom Filter data structure may have to "round" the actual FPP down.

This may just be a case where the BloomFilter has to be more accurate than you asked for, or you got lucky.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • I tested many cases, during duplicate key exam in some billion-row distributed tables. – auntyellow Sep 21 '19 at 06:50
  • @auntyellow What might be "lucky" is the particular value you picked for the FPP. – Louis Wasserman Sep 21 '19 at 16:45
  • @LouisWasserman this is very unlikely. The rule of large numbers contradicts your suggestion. Thomas Mueller is right. The test should be conducted only when all items are already inserted. – Ariel Oct 03 '19 at 05:01