2

I have such schema

schema embeddings {
  document embeddings {
    field id type int {}
    field text_embedding type tensor<double>(d0[960]) {
      indexing: attribute | index
      attribute {
        distance-metric: euclidean
      }
    }
  }

  rank-profile distance {
    num-threads-per-search:1
    inputs {
      query(query_embedding) tensor<double>(d0[960])
    }
    first-phase {
      expression: distance(field, text_embedding)
    }
  }
}

and such query body:

body = {
    'yql': 'select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));',
    "hits":10,
    'input': {
        'query(query_embedding)': [...],
    },
    'ranking': {
        'profile': 'distance',
    },
}

The thing is the output of this query returns different results depending on targetHits parameter. For example, the top-1 distance for targetHits: 10 is 2.847000, and the top-1 distance for targetHits: 200 is 3.028079.

More of that, if I perform the same query using vespa cli:

vespa query -t http://query "select * from embeddings where ([{\"targetHits\":10}] nearestNeighbor(text_embedding, query_embedding));" \
   "approximate=false" \
   "ranking.profile=distance" \
   "ranking.features.query(query_embedding)=[...]"

I'm receiving the third result:

{
    "root": {
        "id": "toplevel",
        "relevance": 1.0,
        "fields": {
            "totalCount": 10
        },
        "coverage": {
            "coverage": 100,
            "documents": 1000000,
            "full": true,
            "nodes": 1,
            "results": 1,
            "resultsFull": 1
        },
        "children": [
            {
                "id": "id:embeddings:embeddings::926288",
                "relevance": 0.8158006540357854,
    ...

where as we can see top-1 distance is 0.8158

So, how can I perform the exact and not approximate nearest neighbors search, which results do not depend on any parameters?

eawer
  • 1,398
  • 3
  • 13
  • 25

1 Answers1

2

Vespa sorts results by descending relevance score. When you use the distance rank-feature instead of closeness as the relevance score (your first-phase ranking expression), you end up inverting the order, so that more distant (worse) neighbors are ranked higher. As you increase targetHits you get even worse neighbors.

The correct query syntax for exact search is to set approximate:false:

select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));

But you want to use closeness(field, text_embedding) in your first-phase ranking expression.

From https://docs.vespa.ai/en/nearest-neighbor-search.html

The closeness(field, image_embedding) is a rank-feature calculated by the nearestNeighbor query operator. The closeness(field, tensor) rank feature calculates a score in the range [0, 1], where 0 is infinite distance, and 1 is zero distance. This is convenient because Vespa sorts hits by decreasing relevancy score, and one usually want the closest hits to be ranked highest. The first-phase is part of Vespa’s phased ranking support. In this example the closeness feature is re-used and documents are not re-ordered.

Jo Kristian Bergum
  • 2,984
  • 5
  • 8
  • Also, totally unrelated, but note that using instead of increases the memory footprint by 2x, plus increases computational complexity and memory bandwidth. – Jo Kristian Bergum Oct 05 '22 at 18:07
  • Thanks for the explanation of `closeness`. Regarding `approximate: false` - as you can see, it already is in my example. Doesn't that mean, that the query should perform a full scan and place the most distant items at the very top? Also, if I change the expression to `-distance(field, text_embedding)` (ascending sorting by distance), the results are still indeterminate and depend on `targetHits` – eawer Oct 05 '22 at 20:05
  • 1
    The results above are just messed up because you use distance, and as you increase targetHits you put the worst hits first, where worst depends on targetHits. The nearestNeighbor search query operator exposes at least `targetHits` to the first-phase ranking function. For example, users can use nn to first retrieve in embedding space, then re-rank by something else. – Jo Kristian Bergum Oct 05 '22 at 20:12
  • 1
    Additionally, when using exact search with `approximate:false` you might hit soft timeout, this is indicated in the response, see https://docs.vespa.ai/en/graceful-degradation.html. Default is 500ms, depending on your corpus size this might also impact the results, as increasing targetHits makes exact search slower as well. You can increase timeout by `timeout=10s` – Jo Kristian Bergum Oct 05 '22 at 20:18