1

I am copying parts of the Simple Semantic Search sample application at https://github.com/vespa-engine/sample-apps/tree/master/simple-semantic-search to get started with dense vector search.

I have indexed our website, dividing every page in paragraph-size docs. Some docs only consist of a name of a person (a single <div> on the website)

With many queries these very short docs get ranked on top although there is no apparent similarity. Querying for "teacher" gives the results below. Why do "Kelly Tracey" and "Luke Hanley" have such a high similarity?

Doc Relevance score
Professor Jake Dalton 0.4810788561826608
Kelly Tracey 0.4618036348887372
Prof. Sarah Jacoby 0.4605411864409834
Luke Hanley 0.45709536853590715
Dr. Elizabeth McDougal 0.4570338357051837
Casey Kemp 0.4508383490617062

I removed the bm25 part of the ranker for testing

    rank-profile simple_semantic inherits default{
        inputs {
            query(e) tensor<float>(x[384])
        }
        first-phase {
            expression: closeness(field, myEmbedding)
        }
    }

Query

        params = {
            "yql": "select * from kvp_semantic_2 where {targetHits: 100}nearestNeighbor(myEmbedding, e)",
            "input.query(e)": 'embed({"teacher"})',
            "ranking.profile": "simple_semantic",
            "hits": 10
        }

The component in services.xml is straight from the sample app

        <component id="bert" class="ai.vespa.embedding.BertBaseEmbedder" bundle="model-integration">
            <config name="embedding.bert-base-embedder">
                <transformerModel path="model/minilm-l6-v2.onnx"/>
                <tokenizerVocab path="model/bert-base-uncased.txt"/>
            </config>
        </component>

The same happens with many other queries, like "biography", but not with some, like "translator".

Roope K
  • 101
  • 5

1 Answers1

2

The model here is just 90 Mb. I don't think you can expect it to contain information about which individual humans are teachers or similar.

When you query for teacher the 6 docs you retrieve are all names of humans and at least two of them are even professors. I think that's pretty good.

Jon
  • 2,043
  • 11
  • 9
  • 1
    That one I understand, but I thought it would rank higher the many paragraphs that have the word teacher, or something similar. Or would it be that in the model certain first or last names are associated with some teachers that happened to be mentioned in the corpus? This is obviously more a BERT question than a Vespa question. Sorry I cannot upvote your answer as this account is all new. – Roope K Oct 02 '22 at 14:33
  • Definitely more of a BERT question :-) What it is doing is matching the meaning of the *whole* query against the whole document text used to produce the embedding. I think it's as expected that the concept "teacher" is closers to named persons than to longer sentences which are about more things. Perhaps you want to match by both embeddings and text and use the ranking signals from both. – Jon Oct 05 '22 at 13:56
  • As you can see I am still learning the very basics of practical vector embeddings, and your first answer helped me leap forward quite a bit. Looking at vocab.txt was the key. What I still don't understand is that if a model is small and intended for general usage, why would they include first and last names at all, other than some really famous people? Could you just replace proper nouns with words like "person", and get more accurate results out-of-domain? – Roope K Oct 06 '22 at 16:17
  • Yes, that might be better, but then you'd need to somehow decide who's a famous person etc. As far as I know no such considerations has gone into this and what is included and not close to the threshold of sufficient notoriety is mostly up to chance. – Jon Oct 06 '22 at 21:07