Bloom Filters: Getting higher error rates than expected

Question

I created a bloom filter using murmur3, blake2b, and Kirsch-Mitzenmacher-optimization, as described in the second answer to this question: Which hash functions to use in a Bloom filter

However, when I was testing it, the bloom filter constantly had a much higher error rate than I was expecting.

Here is the code I used to generate the bloom filters:

public class BloomFilter {
private BitSet filter;
private int size;
private int hfNum;
private int prime;
private double fp = 232000; //One false positive every fp items

public BloomFilter(int count) {
    size = (int)Math.ceil(Math.ceil(((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
    hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
    //size = (int)Math.ceil((hfNum * count) / Math.log(2.0));
    filter = new BitSet(size);

    System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}

public BloomFilter extraSecure(int count) {
    return new BloomFilter(count, true);
}

private BloomFilter(int count, boolean x) {
    size = (int)Math.ceil((((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
    hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
    prime = findPrime();
    size = prime * hfNum;
    filter = new BitSet(prime * hfNum);

    System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}

public void add(String in) {
    filter.set(getMurmur(in), true);
    filter.set(getBlake(in), true);

    if(this.hfNum > 2) {
        for(int i = 3; i <= (hfNum); i++) {
            filter.set(getHash(in, i));
        }
    }
}

public boolean check(String in) {
    if(!filter.get(getMurmur(in)) || !filter.get(getBlake(in))) {
        return false;
    }

    for(int i = 3; i <= hfNum; i++) {
        if(!filter.get(getHash(in, i))) {
            return false;
        }
    }

    return true;
}

private int getMurmur(String in) {
    int temp = murmur(in) % (size);

    if(temp < 0) {
        temp = temp * -1;
    }

    return temp;
}

private int getBlake(String in) {
    int temp = new BigInteger(blake256(in), 16).intValue() % (size);

    if(temp < 0) {
        temp = temp * -1;
    }

    return temp;
}

private int getHash(String in, int i) {
    int temp = ((getMurmur(in)) + (i * getBlake(in))) % size;
    return temp;
}

private int findPrime() {
    int temp;

    int test = size;
    while((test * hfNum) > size ) {
        temp = test - 1;
        while(!isPrime(temp)) {
            temp--;
        }
        test = temp;
    }

    if((test * hfNum) < this.size) {
        test++;
        while(!isPrime(test)) {
            test++;
        }
    }

    return test;
}

private static boolean isPrime(int num) {
    if (num < 2) return false;
    if (num == 2) return true;
    if (num % 2 == 0) return false;
    for (int i = 3; i * i <= num; i += 2)
        if (num % i == 0) return false;
    return true;
}

@Override
public String toString() {
    final StringBuilder buffer = new StringBuilder(size);
    IntStream.range(0, size).mapToObj(i -> filter.get(i) ? '1' : '0').forEach(buffer::append);
    return buffer.toString();
}

}

Here is the code I'm using to test it:

public static void main(String[] args) throws Exception {
    int z = 0;
    int times = 10;
    while(z < times) {
        z++;
        System.out.print("\r");
        System.out.print(z);


        BloomFilter test = new BloomFilter(4000);

        SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
        for(int i = 0; i < 4000; i++) {
            test.add(blake256(Integer.toString(random.nextInt())));
        }

        int temp = 0;
        int count = 1;
        while(!test.check(blake512(Integer.toString(temp)))) {
            temp = random.nextInt();
            count++;
        }

        if(z == (times)) {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count)).getBytes(), StandardOpenOption.APPEND);
        }else {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes(), StandardOpenOption.APPEND);
        }

        if(z == 1) {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes());
        }

    }
}

I expect to get a value relatively close to the fp variable in the bloom filter class, but instead I frequently get half that. Anyone know what I'm doing wrong, or if this is normal?

EDIT: To show what I mean by high error rates, when I run the code on a filter initialized with count 4000 and fp 232000, this was the output in terms of how many numbers the filter had to run through before it found a false positive:

158852,354114,48563,76875,156033,82506,61294,2529,82008,32624

This was generated using the extraSecure() method for initialization, and repeated 10 times to generate these 10 numbers; all but one of them took less than 232000 generated values to find a false positive. The average of the 10 is about 105540, and that's common no matter how many times I repeat this test.

Looking at the values it found, the fact that it found a false positive after only generating 2529 numbers is a huge issue for me, considering I'm adding 4000 data points.

score 1 · Answer 1 · answered Nov 12 '18 at 09:16

I'm afraid I don't know where the bug is, but you can simplify a lot. You don't actually need prime size, you don't need SecureRandom, BigInteger, and modulo. All you need is a good 64 bit hash (seeded if possible, for example murmur):

long bits = (long) (entryCount * bitsPerKey);
int arraySize = (int) ((bits + 63) / 64);
long[] data = new long[arraySize];
int k = getBestK(bitsPerKey);

void add(long key) {
    long hash = hash64(key, seed);
    int a = (int) (hash >>> 32);
    int b = (int) hash;
    for (int i = 0; i < k; i++) {
        data[reduce(a, arraySize)] |= 1L << index;
        a += b;
    }
}

boolean mayContain(long key) {
    long hash = hash64(key, seed);
    int a = (int) (hash >>> 32);
    int b = (int) hash;
    for (int i = 0; i < k; i++) {
        if ((data[reduce(a, arraySize)] & 1L << a) == 0) {
            return false;
        }
        a += b;
    }
    return true;
}

static int reduce(int hash, int n) {
    // http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
    return (int) (((hash & 0xffffffffL) * n) >>> 32);
}

static int getBestK(double bitsPerKey) {
    return Math.max(1, (int) Math.round(bitsPerKey * Math.log(2)));
}

Lev Knoblock · Answer 2 · 2018-03-19T15:14:52.170

Turns out the issue was that the answer on the other page wasn't completely correct, and neither was the comment below it.

The comment said:

in the paper hash_i = hash1 + i x hash2 % p, where p is a prime, hash1 and hash2 is within range of [0, p-1], and the bitset consists k * p bits.

However, looking at the paper reveals that while all the hashes are mod p, each hash function is assigned a subset of the total bitset, which I understood to mean hash1 mod p would determine a value for indices 0 through p, hash2 mod p would determine a value for indices p through 2*p, and so on and so forth until the k value chosen for the bitset is reached.

I'm not 100% sure if adding this will fix my code, but it's worth a try. I'll update this if it works.

UPDATE: Didn't help. I'm looking into what else may be causing this problem.

Bloom Filters: Getting higher error rates than expected

2 Answers2

Linked