I have concatenated two ISO files into one file. Both the individual ISO files are Linux distros of the same vendor but different versions. In the program I have written (shown below), the concatenated file in read in blocks of 512 bytes and MD5sum of the block is computed. The MD5sum is stored in a Hashet<String>
. If a block with the same signature is found using HashSet
lookup, this is recorded.
The exact same algorithm is also done using BloomFilter prior to actual look up on the HashSet
. As a BloomFilter
provides guarantees on "non-containment" only and can provide false-positives on containment, I also look up the HashSet if the BloomFilter
reports that a key might be present already.
The concatenated file size is > 1GB and hence the number of 512 byte block signatures exceeds 1.77 million. The performance of the approach using BloomFilter
is consistently ~six times more than that of the first approach.
Any reasons why this might be the case? Is there something wrong I have done here?
import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.PrimitiveSink;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang3.time.StopWatch;
public class SimpleDedupTrial {
public static void main(String[] args) throws IOException {
int blockSize = 512;
HashSet<String> signatureSet = new HashSet<>();
File f = new File(
"D:\\keshav\\per\\projects\\work-area\\dedup-temp\\merged-iso"
);
FileInputStream fis = new FileInputStream(f);
long length = f.length();
long sizeSaved = 0l;
StopWatch sw = new StopWatch();
int len;
byte[] buffer = new byte[blockSize];
while ((len = fis.read(buffer)) != -1) {
String md5Hex = DigestUtils.md5Hex(buffer);
if (sw.isStopped()) {
sw.start();
}
if (sw.isSuspended()) {
sw.resume();
}
if (signatureSet.contains(md5Hex)) {
sizeSaved += len;
} else {
signatureSet.add(md5Hex);
}
sw.suspend();
}
sw.stop();
fis.close();
System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
System.out.println("File size in MB: "+convertToMB(length));
System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
System.out.println("Signature set size: "+signatureSet.size());
System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));
System.out.println("With Blooom:");
useBloomFilter();
}
private static long convertToMB(long sizeInBytes) {
return sizeInBytes / (1024 * 1024);
}
private static void useBloomFilter() throws IOException {
int blockSize = 512;
Funnel<String> strFunnel = (String t, PrimitiveSink ps) -> {
ps.putString(t, Charsets.US_ASCII);
};
HashSet<String> signatureSet = new HashSet<>();
File f = new File(
"D:\\keshav\\per\\projects\\work-area\\dedup-temp\\merged-iso"
);
FileInputStream fis = new FileInputStream(f);
long length = f.length();
long sizeSaved = 0l;
BloomFilter<String> signatureBloomFilter = BloomFilter.create(
strFunnel, (length / blockSize)
);
StopWatch sw = new StopWatch();
int len;
byte[] buffer = new byte[blockSize];
while ((len = fis.read(buffer)) != -1) {
String md5Hex = DigestUtils.md5Hex(buffer);
if (sw.isStopped()) {
sw.start();
}
if (sw.isSuspended()) {
sw.resume();
}
if (signatureBloomFilter.mightContain(md5Hex)) {
if (!signatureSet.contains(md5Hex)) {
signatureBloomFilter.put(md5Hex);
signatureSet.add(md5Hex);
} else {
sizeSaved += len;
}
} else {
signatureBloomFilter.put(md5Hex);
signatureSet.add(md5Hex);
}
sw.suspend();
}
sw.stop();
fis.close();
System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
System.out.println("File size in MB: "+convertToMB(length));
System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
System.out.println("Signature set size: "+signatureSet.size());
System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));
}
}
Sample output:
Time: 819
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734
With Blooom:
Time: 4539
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734