2

In all the Locality Sensitive Hashing explanations (i.e. http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search )

They describe that k Hash Functions are generated, but only l (l < k) are used in the hash tables to hash the values.

Why generating k at all and not just generate l?

Why the seperate factors k and l?

I don't understand it.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
vardump
  • 658
  • 1
  • 10
  • 17

1 Answers1

1

All of the hash functions are in fact used. This makes more sense if you remember that, for example, in the section "Bit sampling for Hamming distance" an individual hash function might simply return a single bit. In fact another example of an LSH hash function is to consider a randomly chosen plane in some d-dimensional place and to return 0 or 1 according to which side of the plane the point being hashed is.

To address a single table, because the hash functions may return just a single bit, you evaluate k hash functions and concatenate the result, to give you a perhaps a k-bit key. Now with l tables you need l different keys so in fact you need a total of l*k hash functions.

Check: look at the success probability. When looking up a single table a single hash function returns the same value for the query and the near neighbour with probability P1. To find the near neighbour in a single table you must get all the hash functions to work, so that probability is P1^k and that single lookup fails with probability 1 - P1^k. But you try this l times so the probability that all lookups fail is (1-P1^k)^l and the success probability is 1-(1-P1^k)^l, which is exactly what they calculate.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Ok, thanks for that awesome answer! To recapitulate: We have k * l different hash functions in our LSH family L. We create a new family G, that consists of l new functions, which each use the concatenation of k of the functions from L. So the functions g look like the following: g_1 = [l_1, ..., l_k], g_2 = [l_{k+1}, l_{2*k}], g_3 = [l_{2*k + 1}, l_{3*k}] and so on? And now when I want the ANN Query, I look in all tables for a collision. If there is a collision, the other item I found is a near neighbour with the probability p1. What if no collision? – vardump Jun 10 '15 at 14:07
  • The probability I worked out was the probability that any particular near neighbor will be found, assuming that you check every value you retrieve from a hash table - they talk about buckets so I assume that there might be more than one value found in a bucket, with some of them not near neighbors. I have made no calculation at all of the number of things that you might retrieve that are not near neighbors, though the article does have a P2 that seems to be connected with that. If there was no collision then, if there was a near neighbor, you were on the unlucky end of the prob I worked out. – mcdowella Jun 10 '15 at 17:50