How processed tokens get stored in base index in Vespa?

Question

While working with search definition which looks like

search music{
    document music{
        field title type string {
            indexing: summary | attribute | index
        }
    }
}

if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?

score 4 · Accepted Answer · answered Mar 06 '18 at 13:58

First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.

Unnecessary details:

You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html

However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.

The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.

How processed tokens get stored in base index in Vespa?

1 Answers1