4

I read few solutions about nearest neighbor search in high-dimensions using random hyperplane, but I am still confused about how the buckets work. I have 100 millions of document in the form of 100-dimension vectors and 1 million queries. For each query, I need to find the nearest neighbor based on cosine similarity. The brute force approach is to find cosine value of query with all 100 million documents and select the the ones with value close to 1. I am struggling with the concept of random hyperplanes where I can put the documents in buckets so that I don't have to calculate cosine value 100 million times for each query.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
viz12
  • 675
  • 1
  • 11
  • 20

1 Answers1

2

Think in a geometric way. Imagine your data like points in a high dimensional space.

Create random hyperplanes (just planes in a higher dimension), do the reduction using your imagination.

These hyperplanes cut your data (the points), creating partitions, where some points are being positioned apart from others (every point in its partition; would be a rough approximation).

Now the buckets should be populated according to the partitions formed by the hyperplanes. As a result, every bucket contains much less points than the total size of the pointset (because every partition I talked about before, contains less points than the total size of your pointset).

As a consequence, when you pose a query, you check much less points (with the assistance of the buckets) than the total size. That's all the gain here, since checking less points, means that you do much better (faster) than the brute force approach, which checks all the points.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thanks !! I got the idea completely. I am just wondering if false negative is still an issue in this approach. Any suggestion how to implement the random hyperplane based buckets in c/c++. – viz12 Aug 10 '17 at 15:24
  • @viz12 that's a new question. I advice you to accept my answer, think about it a bit more on your own, and if needed, post a new question! =) – gsamaras Aug 10 '17 at 15:29
  • sure I will think about that. – viz12 Aug 10 '17 at 16:00