0

I am currently learning ChromaDB vector DB.

I can't understand how the querying process works.

When I try to query using text, it's returning all documents.

collection.add(
    documents=["This is a document about cat", "This is a document about car"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],
    ids=["id1", "id2"]
)

results = collection.query(
    query_texts=["vehicle"],
    n_results=2
)

results

The output is:

{'ids': [['id2', 'id1']],
 'distances': [[0.8069301247596741, 1.648103952407837]],
 'metadatas': [[{'category': 'vehicle'}, {'category': 'animal'}]],
 'embeddings': None,
 'documents': [['This is a document about car',
   'This is a document about cat']]}

Even I entered a word the not present anywhere, it's still returning all docs.

Why does this happen?

RagAnt
  • 1,064
  • 2
  • 17
  • 35
  • 1
    Did you mean to use `where={'category': 'vehicle'}`? A simple query like what you did is always going to return the whole collection, and the `'distances'` tells you how close the document was to your query text. `query_texts` doesn't look at the metadata. – Tim Roberts Jul 23 '23 at 18:25
  • @TimRoberts Okay. And How to "Text similarity search" inside the document? Like "What are the documents about car?" should return "This is a document about car" only – RagAnt Jul 23 '23 at 18:38
  • 1
    No, it returns ALL the documents, but it tells you how likely it is that each document is about a car. Actually, it only returns the top `n_results` results. – Tim Roberts Jul 23 '23 at 18:43
  • 1
    @TimRoberts So lower the distance, higher the match. Right? – RagAnt Jul 23 '23 at 19:00
  • 1
    I have no idea. The documentation doesn't say, and the authors apparently felt their code needed no comments. – Tim Roberts Jul 23 '23 at 22:40

1 Answers1

2

So, ChromaDB performs a cosine similarity search on the embeddings stored as vectors. So it not just takes in the word "vehicle" as a whole but also considers the way each letter is arranged with the text in the documents you pass in. You can read more about how cosine similarity search works here - https://www.geeksforgeeks.org/cosine-similarity/#

As for the embeddings, they are generated using all-MiniLM-L6-v2. You can read more about it in their document - https://docs.trychroma.com/embeddings

harshiniv
  • 39
  • 3