0

given word: "ABC regional private coastal area"

(shingle filter factory)tokenization i want: "ABC regional private coastal area", "ABC regional private coastal", "ABC regional private", "ABC regional", "ABC".

results: "ABC regional private coastal area", "ABC regional private coastal","ABC regional", "ABC", "regional" etc..

and some times creates tokenization i want like "regional _ coastal", "regional _ coastal area", "_ coastal"

is there any filter or tokenizer that will help me achieve this result.

already tried: edgeNGram(character level token-split), Ngram(character level token-split), Shinglefilterfactory(word leveltoken-split).

results: shingle comes close but it also creates token like word: "hello world sample" after tokenization: hello world , world, sample which gives me unecessary results for both sample and world which i dont need.

Thanks in advance.

use these links to look at the query and results [Query Performed(https://i.stack.imgur.com/TUHHn.png)]Shingle]EdgeNGram]

AMAN
  • 1
  • 1
  • In your first example you also included `regional` by itself, which is in the same as what you say you don't want in your last example. Do you want the shingles only generated from the start? i.e. `ABC`, `ABC regional`, etc, and nothing starting with anything? Only prefixes of the original string? – MatsLindh Nov 17 '22 at 13:11
  • i solved it with custom tokenizer, filter. now its working the way i wanted my tokenizer and filter to work. Thanks :) – AMAN Nov 25 '22 at 09:53

0 Answers0