0

I build a full-ext search index with sqlite and don't understand what is going on internally when i'm scanning documents contain few languages.

For example, i describe a programming topic i'm learning in Russian and add into the description code blocks with programming language syntax statements and comments which are obviously in English.

Let's consider the example document.txt

Вывод хранимых данных производится следующей командой

import storage
def main()  # Comments just to represent an example
    print(storage.data)

As you can see document.txt consists of two languages.

I use the snowball tokenizer(it reuses standard sowball library) to index the completed documents explicitly specifying CREATE TABLE documents USING FTS5(text, tokenize='snowball russian'); and it handles it with no issues. So here is a point why? The documents contain English words and later on, the index contains English stems along with Russian stems, i can search команда or commenting successfully. Is it how things work?

kvdm.dev
  • 141
  • 1
  • 1
  • 12
  • *snowball* isn't one of the standard FTS5 tokenizers. Maybe include a link to its implementation? – Shawn May 27 '21 at 20:51
  • You guess it fully based on an algorithm implementation? My standpoint it's a common issue for all the implementations and it's solution is also should be common, but that's my assumption only and i want to figure it out. Link to the implementation is added. – kvdm.dev May 28 '21 at 09:39

0 Answers0