1

Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures.

How should I set my k value? Since setting it to a really high value would increase computation time significantly, what could be the least value of k which can give me a good jaccard index estimate?

Given error tolerance e>0 and delta, how can I determine the minimum value of k such that the jaccard index is between (1-e)jaccard_estimate and (1+e)jaccard_estimate with probability greater than or equal to (1-delta)?

I believe this can be derived using chernoff inequality bound, but I'm unable to figure about how to go about it. Any help would be appreciated. Thanks in advance!

theitpushover
  • 23
  • 1
  • 4

1 Answers1

0

If J' denotes the estimate and J is the true Jaccard similarity, (k * J') follows a binomial distribution with parameters k and J. As a consequence, the variance of the estimate is Var(J') = J(1-J)/k <= 1/(4*k). Therefore the standard deviation of J is bounded by stdev(J') <= 1/(2*sqrt(k)).

otmar
  • 386
  • 1
  • 9