0

How can i check for duplicate documents in my vectorstore, when adding documents?

Currently I am doing something like:

vectorstore = Chroma(
        persist_directory=persist_dir,
        embedding_function=embeddings
    )

documents = vectorstore.get()['documents']
final_docs = list(filter(lambda x: x not in documents, final_docs))
vectorstore.add_documents(documents=final_docs, embedding=embeddings)

However, I am wondering about the performance in large datasets.

Additionally, will duplicate documents cause issues in practice? From my understanding, they will embed to the same vector, so the only overhead seems to be non-functional (ie latency)

information_interchange
  • 2,538
  • 6
  • 31
  • 49

1 Answers1

0

You can try compute hash for each document and then add it in python Set() which will ensure there are no duplicate documents

    import hashlib

    # Set to store hashes of documents
    document_hashes = set()

    def add_document(doc):
        # Compute SHA256 hash of the document
        doc_hash = hashlib.sha256(doc.encode()).hexdigest()
        
        # Check if hash is in the set
        if doc_hash in document_hashes:
            print("Duplicate document detected!")
            return False
        else:
            document_hashes.add(doc_hash)
            #Your Code for adding document in vector stores
            print("Document added successfully!")
            return True

    #Sample documents 
    doc1 = "This is document 1"
    doc2 = "This is document 2"
    doc3 = "This is document 1"  

    add_document(doc1) 
    add_document(doc2) 
    add_document(doc3) 
ZKS
  • 817
  • 3
  • 16
  • 31
  • Interesting, I am wondering if chroma or some other dedicated vector DB already has this functionality builtin? – information_interchange Aug 23 '23 at 19:30
  • you can create sample code with those vector db and check, if db is not having this functionality then you need to write custom logic for achieving it. – ZKS Aug 24 '23 at 04:36