I have 1M records in my db with such schema:
schema embeddings {
document embeddings {
field id type int {}
field text_embedding type tensor<double>(d0[960]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 100
}
}
}
}
rank-profile closeness {
num-threads-per-search:1
inputs {
query(query_embedding) tensor<double>(d0[960])
}
first-phase {
expression: closeness(field, text_embedding)
}
}
}
My query for finding the nearest neighbors looks like this:
body = {
'yql': 'select * from embeddings where ({approximate:true, targetHits:100} nearestNeighbor(text_embedding, query_embedding));',
"hits":100,
'input': {
'query(query_embedding)': [...],
},
"ranking": {
"profile": "closeness",
"softtimeout": {
"enable": false
}
}
}
For some reasons, for certain vectors the number of results is smaller, than targetHits
. Changing timeouts does not help.
Here is coverage section from the response:
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 39
},
"coverage": {
"coverage": 100,
"documents": 1000000,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
Is there any way to receive exactly (or at least not less than) targetHits
results (obviously there are enough results, since the closeness can be calculated for any other vector in the db)?