3

I have been working with langchain's chroma vectordb. It has two methods for running similarity search with scores.

  1. vectordb.similarity_search_with_score()
  2. vectordb.similarity_search_with_relevance_scores()

According to the documentation, the first one should return a cosine distance in float. enter image description here

Smaller the better.

And the second one should return a score from 0 to 1, 0 means dissimilar and 1 means similar. enter image description here

But when I tried the same it is giving me exactly same results with same scores which overflows the upperlimit 1, which should not be the case for the second function.

What's going on here?

botaskay
  • 144
  • 1
  • 1
  • 7

3 Answers3

2

I have experienced this issue as follows:

vectordb.similarity_search() and vectordb.similarity_search_with_score() return exactly the same top n chucks in the same order. similarity_search_with_score() also has score data. I think this data is important for filtering out irrelevant chucks.

On the other hand, I have read that the vectordb.similarity_search_with_relevance_scores() method is more sophisticated and requires more processing to calculate the similarity score, but I got exactly the same results nearly same duration with vectordb.similarity_search_with_score() method in dozens of comparisons.

Another issue that caught my attention in this regard is the meaning of the scores returned as a result of both methods! In the official document, it is stated that the smaller the score, the higher the similarity. I also read that the range of the score is 0-1.

In my tests, I got different scores. For example some unrelated results with 1.9, 2.03 and 0.03 ...

I can say with my experience that scores between 0.8-1.2 have higher similarity.

msklc
  • 553
  • 1
  • 8
  • 10
  • 1
    during my testing , I found that the score 0.07 0.09 .0.05 is more similarity, and unrelated results with 0.22 0.23. langchain + redis. – will Aug 18 '23 at 02:49
1

In official documentation its cosine distance and not cosine similarity.

Cosine Similarity: Measures the cosine of the angle between vectors, indicating their similarity. Higher values mean greater similarity.

Cosine Distance: Measures the dissimilarity between vectors as the complement of the cosine similarity. Higher values mean greater dissimilarity.

cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
cosine_distance(A, B) = 1 - cosine_similarity(A, B)
toyota Supra
  • 3,181
  • 4
  • 15
  • 19
0

Is you are using Chroma, you should set the distance metric when creating a collection: https://docs.trychroma.com/usage-guide#changing-the-distance-function

The default distance is l2. That is why for me it used to give scores like 3626.016357421875 when using the function similarity_search_with_relevance_scores(). On changing it to cosine, the scores are now between (0, 1] with scores closer to 1 depicting higher similarity.

Chroma.from_documents(documents=documents, embedding=cohere, collection_metadata={"hnsw:space": "cosine"})
ankush1377
  • 91
  • 1
  • 3