0

I have a table of movies in Cassandra (hosted on Astra DB), with a lone primary key of movie_id. There are several columns, but for my vector search I really only care about the title. The movie_vector column has a storage attached index (SAI) on it, which was created with the following CQL:

CREATE CUSTOM INDEX ON movieapp.movies (movie_vector) USING 'StorageAttachedIndex';

When I execute a CQL vector search based on the vector defined for "Star Wars," I get these results:

SELECT title FROM movies
ORDER BY movie_vector ANN OF [37, 4, 8, 13, 42.1497, 8.1, 6778]
LIMIT 6;

 title                   | movie_vector
-------------------------+-------------------------------------
               Star Wars |  [37, 4, 8, 13, 42.1497, 8.1, 6778]
 The Empire Strikes Back | [37, 4, 8, 13, 19.47096, 8.2, 5998]
      Return of the Jedi | [37, 4, 8, 13, 14.58609, 7.9, 4763]
           The Lion King |    [49, 1, 3, 7, 21.60576, 8, 5520]
              Pocahontas |  [10, 1, 3, 4, 13.28007, 6.7, 1509]
                  Batman |    [18, 5, 8, 0, 19.10673, 7, 2145]

(6 rows)

How are these results sorted? Is there some way to see the logic behind that?

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    The default index is `cosine`. If your index is either `dot_product|euclidean`, then use `similarity_[dot_product|euclidean](, )` functions accordingly. – Madhavan Aug 23 '23 at 14:58

1 Answers1

1

Given the defaults and the index shown above, the results returned from a CQL vector search are sorted by the similarity of the cosines of their vectors, relative to the original vector. This can be seen by using the CQL similarity_cosine function, which accepts a column of type Vector<float, n> and the vector itself.

For the above query, it would work like this:

SELECT title,
    similarity_cosine(movie_vector, [37, 4, 8, 13, 42.1497, 8.1, 6778]) AS similarity
FROM movies
ORDER BY movie_vector ANN OF [37, 4, 8, 13, 42.1497, 8.1, 6778]
LIMIT 6;

 title                   | similarity | movie_vector
-------------------------+------------+-------------------------------------
               Star Wars |          1 |  [37, 4, 8, 13, 42.1497, 8.1, 6778]
 The Empire Strikes Back |   0.999998 | [37, 4, 8, 13, 19.47096, 8.2, 5998]
      Return of the Jedi |   0.999996 | [37, 4, 8, 13, 14.58609, 7.9, 4763]
           The Lion King |   0.999995 |    [49, 1, 3, 7, 21.60576, 8, 5520]
              Pocahontas |   0.999995 |  [10, 1, 3, 4, 13.28007, 6.7, 1509]
                  Batman |   0.999992 |    [18, 5, 8, 0, 19.10673, 7, 2145]

(6 rows)

As shown above, The vector for the movie "Star Wars" is a 100% match. This makes sense, as that was the vector ([37, 4, 8, 13, 42.1497, 8.1, 6778]) used in the query.

The remaining rows are ordered by the result of their similarity_cosine, which is based on the proximity of their movie_vector to the original vector. The rows which are closest in proximity to the original vector are at the top of the result set, while the vectors that are farther away are shown at the bottom.

It's a bit verbose, but still a useful way to show how vector search results are sorted.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • similarity_cosine due to the default SAI index on vectors being a cosine index, it will order by the way in which the index is created so if created as a euclidean index, it will order based on similarity_euclidean. – Andrew Aug 23 '23 at 23:41
  • Good call, @Andrew. You're right in that I was assuming the defaults. Edit made. – Aaron Aug 24 '23 at 02:16