I am copying parts of the Simple Semantic Search sample application at https://github.com/vespa-engine/sample-apps/tree/master/simple-semantic-search to get started with dense vector search.
I have indexed our website, dividing every page in paragraph-size docs. Some docs only consist of a name of a person (a single <div>
on the website)
With many queries these very short docs get ranked on top although there is no apparent similarity. Querying for "teacher" gives the results below. Why do "Kelly Tracey" and "Luke Hanley" have such a high similarity?
Doc | Relevance score |
---|---|
Professor Jake Dalton | 0.4810788561826608 |
Kelly Tracey | 0.4618036348887372 |
Prof. Sarah Jacoby | 0.4605411864409834 |
Luke Hanley | 0.45709536853590715 |
Dr. Elizabeth McDougal | 0.4570338357051837 |
Casey Kemp | 0.4508383490617062 |
I removed the bm25 part of the ranker for testing
rank-profile simple_semantic inherits default{
inputs {
query(e) tensor<float>(x[384])
}
first-phase {
expression: closeness(field, myEmbedding)
}
}
Query
params = {
"yql": "select * from kvp_semantic_2 where {targetHits: 100}nearestNeighbor(myEmbedding, e)",
"input.query(e)": 'embed({"teacher"})',
"ranking.profile": "simple_semantic",
"hits": 10
}
The component in services.xml is straight from the sample app
<component id="bert" class="ai.vespa.embedding.BertBaseEmbedder" bundle="model-integration">
<config name="embedding.bert-base-embedder">
<transformerModel path="model/minilm-l6-v2.onnx"/>
<tokenizerVocab path="model/bert-base-uncased.txt"/>
</config>
</component>
The same happens with many other queries, like "biography", but not with some, like "translator".