Solr 6 index content in intervals

Question

I am using solr 6 and my requirement is to find documents which have 5 consecutive words (seperated by space) duplicated in them.

So to achieve this I am planning to index the contents in the inverval of 5 words for example if my content is "The quick brown fox jumps over the lazy dog", it should index as "The quick brown fox jumps", "quick brown fox jumps over", "brown fox jumps over the".

To configure tokenizer, I referred this wiki but didn't found any provided tokenizer that can solve this problem. So I am searching a way to create new tokenizer class or any other way by using provided tokenizer that could solve my problem. It would be appreciable if one could help me to solve this.

score 1 · Answer 1 · answered Jul 10 '17 at 08:11

1

you use the Shingle filter for exactly this purpose. It is a filter, not a tokenizer, but does what you need.

answered Jul 10 '17 at 08:11

Persimmonium

15,593
11
47
78

Solr 6 index content in intervals

1 Answers1