2

I want to create an index of nearly 10M vectors of size 1024. Here is the code that I used.

import numpy as np
import faiss  
import random                

f = 1024

vectors = []
no_of_vectors=10000000
for k in range(no_of_vectors):
    v = [random.gauss(0, 1) for z in range(f)]
    vectors.append(v)
        
np_vectors = np.array(vectors).astype('float32')

index = faiss.IndexFlatL2(f)  
index.add(np_vectors)                 

faiss.write_index(index, "faiss_index.index")

The code is worked for a small number of vectors. But the memory limit exceeds when the number of vectors is about 2M. I used index.add() instead of appending vectors to list(vectors=[]). But it didn't work as well.

I want to know how to create an index for large number of vectors.

Janaka
  • 481
  • 4
  • 14
  • Why do you need 10M vectors of size 1024?! This is incredibly HUGE!!?! – Gugu72 Jan 20 '21 at 17:20
  • I am creating a document similarity checking tool. It consists of a large database of documents. I need to add laser embeddings of all sentences into the index. There are about 10M. – Janaka Jan 20 '21 at 17:28
  • Hmm, maybe use numpy or pandas? – Gugu72 Jan 20 '21 at 17:40

1 Answers1

4

If you want to continue using Faiss, there is a reference for choosing a different index, maybe HNSW or IVFPQ.

ref: https://wangzwhu.github.io/home/file/acmmm-t-part3-ann.pdf go the last page.

enter image description here

And another option is to try some distributed solutions, such as Milvus, which build top of Ann library like faiss

Ji Bin
  • 491
  • 3
  • 9
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/31271899) – BrokenBenchmark Mar 15 '22 at 00:55