What I need
I'm looking for pretty fast and accurate way to find Jaccard similarity among multiple huge sets of data. I could have up to 10.000-20.000 operations of calculating Jaccard similarity. Due to the need to calculate all Jaccard similarities just after dumping of that datasets, I can't calculate them in background with slow algorithms during the quiet winter nights.
I see two possible solutions:
Solution #1
Use MinHash algorithm. The problem with this solution is that it's very slow. To get 10% error you need to use 100 hash functions. The only workaround I see here is to hash everything with single "expensive" hash function, and than use 100 "cheap" hash functions over the hash result. But I haven't enough math background to choose them by myself.
Question #1
How to choose speed efficient set of hash functions for MinHash to get maximum error 10%?
Solution #2
Use HyperLogLog or BitSet to calculate Jaccard similarity.
The problem with this approach is that for some reasons I get too big errors in some cases. Also the problem with BitSet (even it's sparse data structure) is that it takes too much RAM on huger datasets.
My algorithms:
- Choose probability cardinality estimation algorithm (HyperLogLog or BitSet)
- Calculate probable cardinality of
set1
- Calculate probable cardinality of
set2
- Calculate probable cardinality of
set1 union set2
. Both HyperLogLog and BitSet support merge operation. - Similarity between
set2
andset1
=(cardinality(set1) + cardinality(set2) - cardinality(set1 union set2)) / cardinality(set2)
Question #2
Why do I get the same deviation of Jaccard similarity estimation on both BitSet and HyperLogLog? BitSet prove better cardinality precision than HLL. I though that if BitSet takes much more space it should have much bigger accuracy, am I wrong?
Question #3
Is that impossible to achieve deviation of Jaccard similarity less than 5% with BitSet and HyperLogLog? What I'm doing wrong?