0

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and hmin(S) is the minimum hash of set S, i.e. hmin(S)=min(h(s)) for s in S. We have the equation:

p(hmin(A)=hmin(B))=|A∩B| / |A∪B|

Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

I am trying to prove above equation and come up with my own proof: for a∈A and b∈B such that h(a)=hmin(A) and h(b)=hmin(B). So, if hmin(A)=hmin(B) then h(a)=h(b). Assume that hash function h can hash keys to distinct hash value, so h(a)=h(b) if and only if a=b, which has a probability of |A∩B| / |A∪B|. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function.

Long Thai
  • 807
  • 3
  • 12
  • 34

4 Answers4

0

I can't be sure what your exact question is.

But if you are looking for a method to prove:

probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

Try having a look at section 3.3.3 of Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman

Nilesh
  • 1,222
  • 1
  • 11
  • 23
0

Think of the hash function just as a mean to provide a random permutation of (A ∪ B). Now, think about that permutation.

Put every possible element of (A ∪ B) as a row in a table, using the permutation p you have chosen. And two columns A and B, like this:

A = {1, 3, 5, 6}
B = {2, 3, 4, 6}
p = {5, 6, 1, 2, 4, 3}

The table:

   A  B
5  1  0
6  1  1
1  1  0
2  0  1
4  0  1
3  1  1

There are only two types of rows, X: where A and B are 1. Y: where A != B

There are (A ∪ B) rows in total. But only (A ∩ B) rows of type Y. The chance that the first row is one of the type Y is Y/(X+Y). Or Pr[hmin(A) = hmin(B)] = (A ∩ B)/(A ∪ B).

This is exactly what the book Nilesh linked says, but I tried to explain with another example.

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
0

This can't be proved "regardless of the hash function". Just consider: you could use a very poor hash function that produces extremely frequent collisions (such as simply binary-ANDing all values together). MinHash would no longer approximate Jaccard similarity at all, but would report much higher similarities. Proofs of MinHash that I've seen have assumed that hash collisions will be rare enough to be insignificant.

Ben Whitmore
  • 857
  • 1
  • 6
  • 15
0

Assume collisions will never happen, or will be negligible. You just choose a length for your hashes such that the chance of them colliding becomes arbitrarily small. This article describes the bounds for various numbers of items and hash sizes. https://en.wikipedia.org/wiki/Birthday_attack

  • 2
    Thanks for trying to contribute to stack overflow. Though the link might solve/answer the issue/question, its better to add a consolidated details out here to make the answer more clearer. – Karthick Ramesh Nov 19 '18 at 21:17