A simple duplicate block finding algorithm performs worse when using BloomFilter for lookups

Question

I have concatenated two ISO files into one file. Both the individual ISO files are Linux distros of the same vendor but different versions. In the program I have written (shown below), the concatenated file in read in blocks of 512 bytes and MD5sum of the block is computed. The MD5sum is stored in a Hashet<String>. If a block with the same signature is found using HashSet lookup, this is recorded.

The exact same algorithm is also done using BloomFilter prior to actual look up on the HashSet. As a BloomFilter provides guarantees on "non-containment" only and can provide false-positives on containment, I also look up the HashSet if the BloomFilter reports that a key might be present already.

The concatenated file size is > 1GB and hence the number of 512 byte block signatures exceeds 1.77 million. The performance of the approach using BloomFilter is consistently ~six times more than that of the first approach.

Any reasons why this might be the case? Is there something wrong I have done here?

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.PrimitiveSink;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang3.time.StopWatch;

public class SimpleDedupTrial {

    public static void main(String[] args) throws IOException {

        int blockSize = 512;

        HashSet<String> signatureSet = new HashSet<>();

        File f = new File(
            "D:\\keshav\\per\\projects\\work-area\\dedup-temp\\merged-iso"
        );

        FileInputStream fis = new FileInputStream(f);

        long length = f.length();
        long sizeSaved = 0l;

        StopWatch sw = new StopWatch();

        int len;
        byte[] buffer = new byte[blockSize];
        while ((len = fis.read(buffer)) != -1) {

            String md5Hex = DigestUtils.md5Hex(buffer);

            if (sw.isStopped()) {
                sw.start();
            }
            if (sw.isSuspended()) {
                sw.resume();
            }

            if (signatureSet.contains(md5Hex)) {
                sizeSaved += len;
            } else {
                signatureSet.add(md5Hex);
            }

            sw.suspend();
        }

        sw.stop();

        fis.close();

        System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
        System.out.println("File size in MB: "+convertToMB(length));
        System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
        System.out.println("Signature set size: "+signatureSet.size());
        System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));

        System.out.println("With Blooom:");
        useBloomFilter();
    }

    private static long convertToMB(long sizeInBytes) {
        return sizeInBytes / (1024 * 1024);
    }

    private static void useBloomFilter() throws IOException {
        int blockSize = 512;

        Funnel<String> strFunnel = (String t, PrimitiveSink ps) -> {
            ps.putString(t, Charsets.US_ASCII);
        };

        HashSet<String> signatureSet = new HashSet<>();

        File f = new File(
            "D:\\keshav\\per\\projects\\work-area\\dedup-temp\\merged-iso"
        );

        FileInputStream fis = new FileInputStream(f);

        long length = f.length();
        long sizeSaved = 0l;

        BloomFilter<String> signatureBloomFilter = BloomFilter.create(
            strFunnel, (length / blockSize)
        );

        StopWatch sw = new StopWatch();

        int len;
        byte[] buffer = new byte[blockSize];
        while ((len = fis.read(buffer)) != -1) {

            String md5Hex = DigestUtils.md5Hex(buffer);

            if (sw.isStopped()) {
                sw.start();
            }
            if (sw.isSuspended()) {
                sw.resume();
            }

            if (signatureBloomFilter.mightContain(md5Hex)) {
                if (!signatureSet.contains(md5Hex)) {
                    signatureBloomFilter.put(md5Hex);
                    signatureSet.add(md5Hex);
                } else {
                    sizeSaved += len;
                }
            } else {
                signatureBloomFilter.put(md5Hex);
                signatureSet.add(md5Hex);
            }
            sw.suspend();
        }

        sw.stop();

        fis.close();

        System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
        System.out.println("File size in MB: "+convertToMB(length));
        System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
        System.out.println("Signature set size: "+signatureSet.size());
        System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));
    }
}

Sample output:

Time: 819
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734
With Blooom:
Time: 4539
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734

For me you misunderstand the purpose of a `BloomFilter`, a `BloomFilter` is a lightweight data structure that we use in front of something known as slow like disc access... and known as too big to load and to simply keep into memory in cache. Here you already use somehow a cache (your HashSet), if you can afford it, you simply don't need a `BloomFilter`. — Nicolas Filotto, Jan 26 '17 at 09:32
I see. In my use case of finding file block signatures, BloomFilters would not really help. Is that right? Let's assume that millions of file block signatures are already stored in a separate file on disk (computed beforehand) and I have loaded only the corresponding BloomFilter in memory. Whenever the BloomFilter reports that a key might exist, I will have to load all the millions of signatures to memory anyway to ascertain it. Am I right here? — Keshav, Jan 26 '17 at 10:28
If you need to load the millions of signatures into memory, there is something wrong in your architecture. You need to be able to check directly if a given signature exists. Let's say that you have all your signatures defined into a DB and you use the signature as primary key. Let's say that all the signatures cannot fit into your memory, then you would use a BloomFilter in which you would initially load all your signatures, then instead of launching a query to check if a given signature exists, you first check in your BF if it exists which will allow you to reduce the total amount of queries. — Nicolas Filotto, Jan 26 '17 at 17:35
You would only check in case the BF indicates that it may exist, a BF properly configured to limit the probability of false positive can allow to reduce drastically the total amount of accesses to your slow back-end — Nicolas Filotto, Jan 26 '17 at 17:39

score 2 · Accepted Answer · edited Apr 13 '17 at 12:45

It looks like you missed the point of Bloom filter a bit. We use them when we can't afford memory and agree to lose some accuracy. For example decide to deliver 2 push notifications to 1/100 (or not sending to them) users for saving on storing the collection of those who has already received the notification.

In the HashSet you have expected access time O(1), so the Bloom filter won't speed up the process and as you see it slows it down. On the other hand it uses very little memory which isn't significant enough to appear in your statistics.

It's because it takes approx the same time to indicate not in and more for in.

You can read more here.

Right! I guess the BloomFilter is not useful in my use-case. I not only need non-containment, but also need to ascertain containment right away which cannot be done without loading the entire collection of in memory. — Keshav, Jan 26 '17 at 10:30

A simple duplicate block finding algorithm performs worse when using BloomFilter for lookups

1 Answers1