How to find closest embedding vectors?

Question

I have 100K known embedding i.e.

[emb_1, emb_2, ..., emb_100000]

Each of this embedding is derived from GPT-3 sentence embedding with dimension 2048.

My task is given an embedding(embedding_new) find the closest 10 embedding from the above 100k embedding.

The way I am approaching this problem is brute force.

Every time a query asks to find the closest embeddings, I compare embedding_new with [emb_1, emb_2, ..., emb_100000] and get the similarity score.

Then I do quicksort of the similarity score to get the top 10 closest embedding.

Alternatively, I have also thought about using Faiss.

Is there a better way to achieve this?

what do you use for similarity? what is the word dimension? if each similarity is O(d), then your algorithm already runs in O(d n log n). you can reduce the log n factor using heapselect but anything further may or may not be possible with k-d trees, depending on distribution and similarity measure. by curse of dimensionality, you likely need much more than 2^d points for a k-d tree to be successful — qwr, Jul 24 '22 at 03:49

elkbrs · Answer 1 · 2022-07-26T06:51:11.617

Using you own idea, just make sure that the embeddings are in a matrix form, you can easily use numpy for this. This is computed in linear time (in num. of embeddings) and should be fast.

import numpy as np
k = 10 # k best embeddings
emb_mat = np.stack([emb_1, emb_2, ..., emb_100000])
scores = np.dot(emb_mat, embedding_new)
best_k_ind = np.argpartition(scores, k)[-k:] 
top_k_emb = emb_mat[best_k_ind]

The 10 best embeddings will be found in top_k_emb. For a general solution inside a software project you might consider Faiss by Facebook Research. An example for using Faiss:

d = 2048  # dimensionality of your embedding data
k = 10  # number of nearest neighbors to return
index = faiss.IndexFlatIP(d)
emb_list = [emb_1, emb_2, ..., emb_100000]
index.add(emb_list)
D, I = index.search(embedding_new, k)

You can use IndexFlatIP for inner product similarity, or indexFlatL2 for Euclidian\L2-norm distance. In order to bypass memory issues (data>1M) refer to this great infographic Faiss cheat sheet at slide num. 7

score 0 · Accepted Answer · answered Jul 24 '22 at 23:14

I found a solution using Vector Database Lite (VDBLITE)

VDBLITE here: https://pypi.org/project/vdblite/

import vdblite
from time import time
from uuid import uuid4
import sys
from pprint import pprint as pp


if __name__ == '__main__':
    vdb = vdblite.Vdb()
    dimension = 12    # dimensions of each vector                         
    n = 200    # number of vectors                   
    np.random.seed(1)             
    db_vectors = np.random.random((n, dimension)).astype('float32')
    print(db_vectors[0])
    for vector in db_vectors:
        info = {'vector': vector, 'time': time(), 'uuid': str(uuid4())}
        vdb.add(info)
    vdb.details()
    results = vdb.search(db_vectors[10])
    pp(results)

Looks like it uses FAISS behind the scene.

How to find closest embedding vectors?

2 Answers2