1

Consider the following case with word delimeter graph token filter and catenate_words set to true I will get the following tokens super-duper-xl → [ superduperxl, super, duper, xl ]

However the desired tokens are all sequential combination around the delimeter super-duper-xl → [ superduperxl, superduper, duperxl, super, duper, xl ]

Can anyone suggest what could be the best way to do it? any config available in wdgf itself that can be utlised or writing a custom analyzer is the only option that we have?

Yatin
  • 727
  • 1
  • 9
  • 40
  • Not sure if it's the best option but you can set the WDF to `preserveOriginal="1"` and put a `PatternCaptureGroupFilterFactory` just after in the analysis chain, so that it can catch the original token and emits sub-tokens that the wdf missed, based on its ALPHA delimiter config. Note that this requires hard time to adjust the filter according to the wdf settings and make it work properly, on my side I ended with a huge regex for the word parts, and added another filter to emit sub-tokens for the number parts. If you are comfortable with regex, give it a try. – EricLavault Jan 12 '22 at 19:11
  • @EricLavault thanks. This is another way which is slightly simpler than the method you suggested We can use a shingle filter to gnerate combinations and then pair it up with word delimeter graph token filter with catenate settings. as per use case we can control possible combinations of shingles as well as possible concatenation. – Yatin Jan 13 '22 at 21:57

0 Answers0