0

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement.

This is my scenario:

I have a huge set of books and if a single page is given as a query, I need to find the corresponding book from which this page is obtained from. The limitation is, I have features for the entire book and it's impossible to get page-by-page features for the books. In this case, Jaccard similarity is giving poor results if the book is too big. What I really want is the distance between query page and the books (not vice-versa). That is:

Given 2 sets A, B: I want the distance from A to B,

dis(A->B) =  (A & B)/A

Is there similar distance metric that gives distance from set A to set B. Further, is it still possible to use MinHashing algorithm with this kind of similarity metric?

Maggie
  • 5,923
  • 8
  • 41
  • 56
  • Can you provide details on your implementation? Which hash functions did you use? How many of them? – Juan Lopes Aug 16 '15 at 03:27
  • I am using this MinHash implentation with 512 permutations. https://github.com/ekzhu/datasketch – Maggie Aug 16 '15 at 03:30
  • [Also posted on CS.SE](http://cs.stackexchange.com/q/45320/755). Please [do not post the same question on multiple sites](http://meta.stackexchange.com/q/64068). Each community should have an honest shot at answering without anybody's time being wasted. – D.W. Apr 16 '16 at 20:37

1 Answers1

1

We can estimate your proposed distance function using a similar approach as the MinHash algorithm.

For some hash function h(x), compute the minimal values of h over A and B. Denote these values h_min(A) and h_min(B). The MinHash algorithm relies on the fact that the probability that h_min(A) = h_min(B) is (A & B) / (A | B). We may observe that the probability that h_min(A) <= h_min(B) is A / (A | B). We can then compute (A & B) / A as the ratio of these two probabilities.

Like in the regular MinHash algorithm, we can approximate these probabilities by repeated sampling until the desired variance is achieved.

augurar
  • 12,081
  • 6
  • 50
  • 65