How can i check for duplicate documents in my vectorstore, when adding documents?
Currently I am doing something like:
vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=embeddings
)
documents = vectorstore.get()['documents']
final_docs = list(filter(lambda x: x not in documents, final_docs))
vectorstore.add_documents(documents=final_docs, embedding=embeddings)
However, I am wondering about the performance in large datasets.
Additionally, will duplicate documents cause issues in practice? From my understanding, they will embed to the same vector, so the only overhead seems to be non-functional (ie latency)