Set distance as similarity metric for MinHashing algorithm

Question

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement.

This is my scenario:

I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:

Given 2 sets A, B: I want the distance from A to B,

dis(A->B) =  (A & B)/A

Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing algorithm with this kind of similarity metric?

Can you provide details on your implementation? Which hash functions did you use? How many of them? — Juan Lopes, Aug 16 '15 at 03:27
I am using this MinHash implentation with 512 permutations. https://github.com/ekzhu/datasketch — Maggie, Aug 16 '15 at 03:30
[Also posted on CS.SE](http://cs.stackexchange.com/q/45320/755). Please [do not post the same question on multiple sites](http://meta.stackexchange.com/q/64068). Each community should have an honest shot at answering without anybody's time being wasted. — D.W., Apr 16 '16 at 20:37

score 1 · Answer 1 · answered Aug 17 '15 at 08:14

We can estimate your proposed distance function using a similar approach as the MinHash algorithm.

For some hash function h(x), compute the minimal values of h over A and B. Denote these values h_min(A) and h_min(B). The MinHash algorithm relies on the fact that the probability that h_min(A) = h_min(B) is (A & B) / (A | B). We may observe that the probability that h_min(A) <= h_min(B) is A / (A | B). We can then compute (A & B) / A as the ratio of these two probabilities.

Like in the regular MinHash algorithm, we can approximate these probabilities by repeated sampling until the desired variance is achieved.

Set distance as similarity metric for MinHashing algorithm

1 Answers1