opensearchserver tokenizer for permutation of all words in query

Question

I need to configure Open-search server to analyse the query in such a way that any permutation of words in the query are matched, it return the document.

For example, In indexation of a field I have a phrase "knee pain". Now if my query is like "how to remove pain in human knee". I want that this query output the document having "knee pain" in indexation field.

Hence my requirement to break the query string as "remove","pain","human","knee","remove pain",""remove knee","remove human","pain knee","human knee","knee pain","human pain",etc.

So that it matches "knee pain". Is there any tokenizer or filter which can help me to achieve this.

score 1 · Answer 1 · answered May 03 '16 at 12:44

1

Select your index, click on the Schema tab, and then click the Analyzers tab.

I normally edit the TextAnalyzer and add additional filters to it. I normally start with the lower case and stop filter to make searches case-insensitive and remove stop words like "a", "an", "the".

Then, the Shingle filter will give you the n-grams to make phrase matches. Shingle filter with a shingle size of 3-4 four words usually works. Shingling is creating overlapping permutations of word phrases from the analyzed text. "The brown fox jumps high" with a shingle size of 3 would create analyzed n-grams of 1,2, and 3 words. IE, 1-word: "the", "brown", "fox", "jumps", "high". 2-word: "the brown", "brown fox", "fox jumps", "jumps high", etc.

answered May 03 '16 at 12:44

Fix It Scotty

2,852
11
12

shingle filter works for consecutive words combination and also in one direction. like in your image "knee pain" is not present. Also can you tell what is the number associated with each token signifies (number inside square bracket) – Ankit Agarwal May 03 '16 at 16:25
The number in the token is the start and end character position of that term in the original analyzed string. For example: remove [7,13 - 1], that word starts at character 7 in the string and ends at character position 13. I'm not sure what the " - 1" is. It is true that shingling will not create every permutation of words in a string - only adjacent words. But non-adjacent word matching is handled by the lucene scoring. The shingle filter gives a higher score for adjacent word phrase matching because the phase will exactly match the n-gram token. – Fix It Scotty May 03 '16 at 17:45
so , there is no way to get tokens in all possible permutaion – Ankit Agarwal May 04 '16 at 05:32

opensearchserver tokenizer for permutation of all words in query

1 Answers1