How to provide OpenNLP model for tokenization in vespa?

Question

How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?

Required for CJK support.

score 2 · Accepted Answer · answered May 20 '21 at 16:25

Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.

However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.

To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram

n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.

**yql=select * from sources * where msg_content contains "ありがとう";&language="jpn"**. Is this the correct format to use _language_ query parameter? — Snaps, May 21 '21 at 05:10
Yes. See https://docs.vespa.ai/en/reference/query-api-reference.html#model.language — Jon, May 21 '21 at 07:56

How to provide OpenNLP model for tokenization in vespa?

1 Answers1