4

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.

The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.

notes: I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.

I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.

alex
  • 1,757
  • 4
  • 21
  • 32

4 Answers4

3

I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search

It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.

It is also implemented in the FAISS library for similarity search, so you could also take a look at that.

fzk
  • 443
  • 1
  • 5
  • 12
2

One way to go about this is the following:

(1) Arrange the vectors into a tree (a radix tree).

(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold

(3) From (2) generate a subtree that contains all the matching vectors

(4) Now, repeat process (2) on the sub tree with a smaller threshold

Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.

Tyler Durden
  • 11,156
  • 9
  • 64
  • 126
1

answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib

alex
  • 1,757
  • 4
  • 21
  • 32
  • 1
    http://ann-benchmarks.com/ is a more direct link. Note though that they don't test binary similarity search (like minhash); and that small world graphs have their own issues (take quadratic time to build; don't work on hard data sets) – Thomas Ahle Nov 15 '19 at 10:57
  • thanks, by "binary similarity search" i guess you mean similarity search via Jacquard set similarity coefficient as in minhash (as opposed to Euclidean distance, cosine or Hamming distance metric) – alex Nov 15 '19 at 20:47
  • 1
    Jaccard as an example, but also other similarity measures on binary data: https://arxiv.org/pdf/1612.07710 – Thomas Ahle Nov 15 '19 at 22:09
0

You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Ron
  • 56
  • 2