1

I've been reading up on the literature around locality sensitive hashing, and I think have a pretty good understanding of how it works. Considering the most simple case of a single hash table where each document is in only one bucket, my question is:

How do I find k nearest neighbours where k is greater than the number of documents in that bucket?

I've seen several methods to accomplish this. Some use a prefix tree. Others sort all the buckets by their hamming distance.

My constraints:

I have my document IDs stored in PostgreSQL alongside their respective buckets. A full table scan to calculate the hamming distance with each bucket is not feasible (I have hundreds of millions of documents). My bucket hash will likely be 24, or 32 bits (unless there is a suggestion against this). Does anyone have experience with, or suggested approaches on how to proceed?

JVillella
  • 1,029
  • 1
  • 11
  • 21

0 Answers0