I created a bloom filter using murmur3, blake2b, and Kirsch-Mitzenmacher-optimization, as described in the second answer to this question: Which hash functions to use in a Bloom filter
However, when I was testing it, the bloom filter constantly had a much higher error rate than I was expecting.
Here is the code I used to generate the bloom filters:
public class BloomFilter {
private BitSet filter;
private int size;
private int hfNum;
private int prime;
private double fp = 232000; //One false positive every fp items
public BloomFilter(int count) {
size = (int)Math.ceil(Math.ceil(((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
//size = (int)Math.ceil((hfNum * count) / Math.log(2.0));
filter = new BitSet(size);
System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}
public BloomFilter extraSecure(int count) {
return new BloomFilter(count, true);
}
private BloomFilter(int count, boolean x) {
size = (int)Math.ceil((((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
prime = findPrime();
size = prime * hfNum;
filter = new BitSet(prime * hfNum);
System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}
public void add(String in) {
filter.set(getMurmur(in), true);
filter.set(getBlake(in), true);
if(this.hfNum > 2) {
for(int i = 3; i <= (hfNum); i++) {
filter.set(getHash(in, i));
}
}
}
public boolean check(String in) {
if(!filter.get(getMurmur(in)) || !filter.get(getBlake(in))) {
return false;
}
for(int i = 3; i <= hfNum; i++) {
if(!filter.get(getHash(in, i))) {
return false;
}
}
return true;
}
private int getMurmur(String in) {
int temp = murmur(in) % (size);
if(temp < 0) {
temp = temp * -1;
}
return temp;
}
private int getBlake(String in) {
int temp = new BigInteger(blake256(in), 16).intValue() % (size);
if(temp < 0) {
temp = temp * -1;
}
return temp;
}
private int getHash(String in, int i) {
int temp = ((getMurmur(in)) + (i * getBlake(in))) % size;
return temp;
}
private int findPrime() {
int temp;
int test = size;
while((test * hfNum) > size ) {
temp = test - 1;
while(!isPrime(temp)) {
temp--;
}
test = temp;
}
if((test * hfNum) < this.size) {
test++;
while(!isPrime(test)) {
test++;
}
}
return test;
}
private static boolean isPrime(int num) {
if (num < 2) return false;
if (num == 2) return true;
if (num % 2 == 0) return false;
for (int i = 3; i * i <= num; i += 2)
if (num % i == 0) return false;
return true;
}
@Override
public String toString() {
final StringBuilder buffer = new StringBuilder(size);
IntStream.range(0, size).mapToObj(i -> filter.get(i) ? '1' : '0').forEach(buffer::append);
return buffer.toString();
}
}
Here is the code I'm using to test it:
public static void main(String[] args) throws Exception {
int z = 0;
int times = 10;
while(z < times) {
z++;
System.out.print("\r");
System.out.print(z);
BloomFilter test = new BloomFilter(4000);
SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
for(int i = 0; i < 4000; i++) {
test.add(blake256(Integer.toString(random.nextInt())));
}
int temp = 0;
int count = 1;
while(!test.check(blake512(Integer.toString(temp)))) {
temp = random.nextInt();
count++;
}
if(z == (times)) {
Files.write(Paths.get("counts.txt"), (Integer.toString(count)).getBytes(), StandardOpenOption.APPEND);
}else {
Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes(), StandardOpenOption.APPEND);
}
if(z == 1) {
Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes());
}
}
}
I expect to get a value relatively close to the fp variable in the bloom filter class, but instead I frequently get half that. Anyone know what I'm doing wrong, or if this is normal?
EDIT: To show what I mean by high error rates, when I run the code on a filter initialized with count 4000 and fp 232000, this was the output in terms of how many numbers the filter had to run through before it found a false positive:
158852,354114,48563,76875,156033,82506,61294,2529,82008,32624
This was generated using the extraSecure() method for initialization, and repeated 10 times to generate these 10 numbers; all but one of them took less than 232000 generated values to find a false positive. The average of the 10 is about 105540, and that's common no matter how many times I repeat this test.
Looking at the values it found, the fact that it found a false positive after only generating 2529 numbers is a huge issue for me, considering I'm adding 4000 data points.