Searching within a text using ngram for the minimum chars of the search pattern and above

Question

I have an index of text in my elastic server. I have implemented an ngram tokenizer like this:

"analysis": {
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer"
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "7"
        }
      }
    },

Lets say my data is

"Hello beautiful world ell"

When i place a query match "Hell" I want it to only find me the first word (Hello) and not also the word ell, so basickly i dont want it to "break" my search pattern just to find it in my data as is (with 4 charecters and not below)

Thank you

Hello will be tokenized as -> "Hel, ell, llo, Hell, ello, Hello" and ell as -> "ell" and when you search for it you will still have only one result and that is your entire string -> "Hello beautiful world ell". Lets say that you have list off sentences and one is "Hello beautiful world", and other one is "beautiful world ell", and you search for "ell" you will get them both since that is how your tokenizer indexed them. — mirzak, Dec 06 '16 at 12:45
I agree with you, but i was searching for Hell, and i would like to get the word Hell and hello and not ell (since i didnt search for it- it has less letters and missing H) — IB., Dec 06 '16 at 12:53
What i dont understand is why the searcher is breaking my word into Hel,ell,hell and not only searching for the phrase Hell — IB., Dec 06 '16 at 12:54
Thank you, but i do need it from 3 and above, The problem is it takes my 4 letters word and search it in chunks of 3 letters as well which i dont want, If i search for a 4 letters word then search only for 4 letters words in my data or more then 4 — IB., Dec 06 '16 at 13:04
Ok now I get what you mean. Is there something else beside this settings you showed ? This should work. — mirzak, Dec 06 '16 at 13:15
It doesn't work, i still get the ell as a match for the phrase "beautiful world ell" as an example — IB., Dec 06 '16 at 13:19
My guess would be that you have something that tokenizes your search query or something like that. — mirzak, Dec 06 '16 at 13:32

score 1 · Accepted Answer · answered Dec 06 '16 at 14:05

1

The solution would be to change the tokenizer in your analyzer.

For example you could do it like

"some_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    "filter": [ "lowercase" ]
  }

Important is that your search analyzer does not have nGram tokenizer.

answered Dec 06 '16 at 14:05

mirzak

1,043
4
15
30

Thank you, but i think the whitespace tokenizer will not allow me to search partial phrases like if i have the text "Hello beautiful world ell" and i search for "Hell" it will not find it in the first word like i am expecting, no? – IB. Dec 06 '16 at 14:52
I have just tried it like this. I have indexed a "Hello beautiful world ell" and searched for "Hell" -> It had one hit in "Hello". This is because I use nGram tokenizer (3 - 30). Hello will be tokenized as : "Hel, ell, llo, Hell, ello, Hello", and "ell" is just "ell". And search analyzers tokenizer is "whitespace" that means that it will split search string on whitespace. In my case it will not split since its just "Hell". And it was able to find it since I have "Hell" indexed as one of terms. More on terms : https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – mirzak Dec 07 '16 at 07:03
Thank you very much, that exactly what i did and it solved this problem! thank you. New problem now, i am using highlights, and due to the fact i am searching Hell it will not nightlight the word Hello for some reason... – IB. Dec 07 '16 at 07:10
That will never work. Highlighting will only highlight the hit. So in your case it would only be like "Hello beautiful world ell". – mirzak Dec 07 '16 at 07:15
Agree, but it doesn't highlight it (I wanted to do it like this) but for some reason it doesn't. – IB. Dec 07 '16 at 07:28
You should create another question for it and see if somebody has a solution, and close this one. – mirzak Dec 07 '16 at 09:32

Searching within a text using ngram for the minimum chars of the search pattern and above

1 Answers1