im new to Faiss! My task is to find similar vectors with inner product. Cause of limited ram on my laptop, im currently trying to add some new vectors to trained index I've created before.
Situation: im already have trained and tuned index, I want to add some new vectors there. Im trying to do it with batches. This is my code to deal with it:
BATCH_SIZE = 100_000
for idx in tqdm(range( int( (len(df_tr) - N) / BATCH_SIZE))):
index = faiss.read_index("indexes/trained_block.index")
X = scipy.sparse.csr_matrix.toarray(X_full[idx * BATCH_SIZE:(idx + 1) * BATCH_SIZE]).astype('float32')
faiss.normalize_L2(X)
index.add(X)
faiss.write_index(index, "indexes/block_{}.index".format(idx))
ivfs = []
for idx in tqdm(range( int( (len(df_tr) - N) / BATCH_SIZE))):
index = faiss.read_index("indexes/block_{}.index".format(idx))
ivfs.append(index.invlists)
index.own_invlists = False
index = faiss.read_index("indexes/trained_block.index")
invlists = faiss.OnDiskInvertedLists(index.nlist, index.code_size, "indexes/merged_index.ivfdata")
ivf_vector = faiss.InvertedListsPtrVector()
for ivf in ivfs:
ivf_vector.push_back(ivf)
ntotal = invlists.merge_from(ivf_vector.data(), ivf_vector.size())
index.ntotal = ntotal
index.replace_invlists(invlists)
faiss.write_index(index, "indexes/merged_index.index")
And when im trying to find similar vectors I get labels only in range 0 - 100 000 which is only one batch size.
query = scipy.sparse.csr_matrix.toarray(vectorizer.transform(['sample'])).astype('float32')
index.nprobe = 10
D, I = index.search(query, 100)
print(I)
>! [[93121 75215 99842 17907 17835 94646 93832 95062 87345 91036 87749 88507
>! 86637 84382 82840 17261 84315 93969 78607 94330 99566 49088 95428 85836
>! 77877 54978 91496 55231 75761 21885 64547 78052 81165 8370 81296 92231
>! 67480 78757 16133 56417 43638 25109 77122 43178 53848 65869 49360 8440
>! 3287 88457 21400 28398 15780 94845 35407 92137 55795 98621 13516 53323
>! 23751 50605 62996 13813 59634 31121 86262 5930 39545 79405 91105 15471
>! 23820 66360 46133 29015 28760 25257 15921 1079 47869 53775 26922 40162
>! 79801 86765 82793 29220 53651 21723 11123 83319 47878 93225 2211 44512
>! 65712 41331 83744 95585]]
Does anybody worked with Faiss and have some ideas how to fix it?
Appreciate