Proof of calculating Minhash

Question

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and hmin(S) is the minimum hash of set S, i.e. hmin(S)=min(h(s)) for s in S. We have the equation:

p(hmin(A)=hmin(B))=|A∩B| / |A∪B|

Which means the probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

I am trying to prove above equation and come up with my own proof: for a∈A and b∈B such that h(a)=hmin(A) and h(b)=hmin(B). So, if hmin(A)=hmin(B) then h(a)=h(b). Assume that hash function h can hash keys to distinct hash value, so h(a)=h(b) if and only if a=b, which has a probability of |A∩B| / |A∪B|. However, my proof is not complete since hash function can return the same value for different keys. So, I'm asking for your help to find a proof which can be applied regardless the hash function.

score 0 · Answer 1 · answered May 10 '13 at 21:00

I can't be sure what your exact question is.

But if you are looking for a method to prove:

probability that minimum hash of A equals to minimum hash of B is the Jaccard similarity of A and B.

Try having a look at section 3.3.3 of Mining of Massive Datasets, by Anand Rajaraman and Jeff Ullman

score 0 · Answer 2 · answered Dec 08 '15 at 00:30

Think of the hash function just as a mean to provide a random permutation of (A ∪ B). Now, think about that permutation.

Put every possible element of (A ∪ B) as a row in a table, using the permutation p you have chosen. And two columns A and B, like this:

A = {1, 3, 5, 6}
B = {2, 3, 4, 6}
p = {5, 6, 1, 2, 4, 3}

The table:

There are only two types of rows, X: where A and B are 1. Y: where A != B

There are (A ∪ B) rows in total. But only (A ∩ B) rows of type Y. The chance that the first row is one of the type Y is Y/(X+Y). Or Pr[hmin(A) = hmin(B)] = (A ∩ B)/(A ∪ B).

This is exactly what the book Nilesh linked says, but I tried to explain with another example.

score 0 · Answer 3 · answered Feb 19 '18 at 04:20

This can't be proved "regardless of the hash function". Just consider: you could use a very poor hash function that produces extremely frequent collisions (such as simply binary-ANDing all values together). MinHash would no longer approximate Jaccard similarity at all, but would report much higher similarities. Proofs of MinHash that I've seen have assumed that hash collisions will be rare enough to be insignificant.

score 0 · Answer 4 · answered Nov 19 '18 at 21:14

0

Assume collisions will never happen, or will be negligible. You just choose a length for your hashes such that the chance of them colliding becomes arbitrarily small. This article describes the bounds for various numbers of items and hash sizes. https://en.wikipedia.org/wiki/Birthday_attack

answered Nov 19 '18 at 21:14

Ryan Moulton

101

2

Thanks for trying to contribute to stack overflow. Though the link might solve/answer the issue/question, its better to add a consolidated details out here to make the answer more clearer. – Karthick Ramesh Nov 19 '18 at 21:17

Proof of calculating Minhash

4 Answers4