7

I have a system consisting of a few application instances, written in Java. Requests to them are load balanced for high availability. Every second, hundreds of small chunks of data (each consisting of a few simple strings) are received by this "cluster", stored in the database, kept for a couple of days and then discarded. Apart from storing this data, the main requirement is to quickly determine if a given value is stored in the database or not. An appropriately indexed and partitioned database table seems suited to the problem and does it's job well, at least for now.

The problem is, about 80% of the searched values are not found because they are not in the database. Therefore I would like to speed things up a bit and make the search faster and less resource-intensive. A bloom filter would be the obvious choice, were it not for the fact that different application instances receive different parts of data and if each application instance has only a part of the data in it's bloom filter then those bloom filters are useless.

Do you have any suggestions/ideas on how to solve this problem?

zgguy
  • 226
  • 1
  • 5

1 Answers1

4

kept for a couple of days and then discarded

Bloom filter does not support deleting objects, only inserting.
If you have multiple bloom filters, you have to query them all to check if one of them contains the object you need.

Bloom filters can be effectively merged, if they have the same structure (the same size, the same hash function, etc).

You can use this Bloom filter: https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilter.java

And merge two filters like this:

BloomFilter merge(BloomFilter dstFilter, BloomFilter srcFilter) {
    OpenBitSet dst = dstFilter.bitset;
    OpenBitSet src = srcFilter.bitset;

    for (int i = 0; i < src.getPageCount(); ++i) {
        long[] dstBits = dst.getPage(i);
        long[] srcBits = src.getPage(i);

        for (int j = 0; j < srcBits.length; ++j) {
            dstBits[j] |= srcBits[j];
        }
        dst.setPage(i, dstBits);
    }
    return dstFilter;
}
Viacheslav Shalamov
  • 4,149
  • 6
  • 44
  • 66
  • 1
    Yes, but since old data is discarded only once per day, I thought it made sense to rebuild the bloom filter every time old data is discarded. – zgguy May 19 '18 at 07:03
  • Did it answer you question? – Viacheslav Shalamov May 19 '18 at 21:28
  • 3
    Note: if you use a counting bloom filter, you can delete individual keys: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/bloom/CountingBloomFilter.html – Erwin Bolwidt May 20 '18 at 01:41
  • 1
    Thanks for the comments. Perhaps my question is a bit misdirected; it is not about bloom filters in themselves but rather on how to keep filters in sync across multiple instances effectively. But anyhow your comments are useful, thanks. – zgguy May 20 '18 at 09:39
  • @zgguy were you able to solve this? How to keep bloom filter synchronised? Or best way tk use in distributed environment? – Ankit Raonka Oct 06 '21 at 09:50