Jaccard similarity algorithm for thousands of huge datasets in runtime

Question

What I need

I'm looking for pretty fast and accurate way to find Jaccard similarity among multiple huge sets of data. I could have up to 10.000-20.000 operations of calculating Jaccard similarity. Due to the need to calculate all Jaccard similarities just after dumping of that datasets, I can't calculate them in background with slow algorithms during the quiet winter nights.

I see two possible solutions:

Solution #1

Use MinHash algorithm. The problem with this solution is that it's very slow. To get 10% error you need to use 100 hash functions. The only workaround I see here is to hash everything with single "expensive" hash function, and than use 100 "cheap" hash functions over the hash result. But I haven't enough math background to choose them by myself.

Question #1

How to choose speed efficient set of hash functions for MinHash to get maximum error 10%?

Solution #2

Use HyperLogLog or BitSet to calculate Jaccard similarity.
The problem with this approach is that for some reasons I get too big errors in some cases. Also the problem with BitSet (even it's sparse data structure) is that it takes too much RAM on huger datasets.

My algorithms:

Choose probability cardinality estimation algorithm (HyperLogLog or BitSet)
Calculate probable cardinality of set1
Calculate probable cardinality of set2
Calculate probable cardinality of set1 union set2. Both HyperLogLog and BitSet support merge operation.
Similarity between set2 and set1 = (cardinality(set1) + cardinality(set2) - cardinality(set1 union set2)) / cardinality(set2)

Question #2

Why do I get the same deviation of Jaccard similarity estimation on both BitSet and HyperLogLog? BitSet prove better cardinality precision than HLL. I though that if BitSet takes much more space it should have much bigger accuracy, am I wrong?

Question #3

Is that impossible to achieve deviation of Jaccard similarity less than 5% with BitSet and HyperLogLog? What I'm doing wrong?

P.S.

Hope this test results would be helpful for you!

How large are your sets ? What are the set elements? (integers/words/tokens ?) How many sets ? — joop, Apr 04 '17 at 13:56
@joop I'm writing app which will accept data from user. Data sets are actually a database table's columns, where database is provided by user. So dataset could be from 100 to 1 billion rows, with any data inside — VB_, Apr 04 '17 at 15:44
Shouldn't a question contain one question, instead of multiple ones? — GhostCat, Apr 05 '17 at 21:19
@GhostCat than it will sound like "Help me please to make everything work". I trying to explain my workaround to u and problems I faced. This problem could be solved in multiple ways, so it contains multiple questions — VB_, Apr 06 '17 at 08:34

score 2 · Answer 1 · answered Apr 05 '17 at 21:15

1) There is a variant of minwise hashing called one permutation hashing (see http://papers.nips.cc/paper/4778-one-permutation-hashing.pdf) that uses only a single hash function. Estimation can be somehow inaccurate for small sets where the number of elements is small compared to the number of bins. However, in this case it is possible to "densify" the hash signature of a set using the technique described in https://arxiv.org/pdf/1406.4784.pdf.

2) Bitsets are actually a special case of HyperLogLog sketches as discussed in https://arxiv.org/pdf/1702.01284.pdf. This paper also describes a maximum likelihood method that allows more accurate estimation of union and intersections sizes of two HyperLogLog sketches which can be used to finally get an estimate for the Jaccard similarity.