I'm trying to use Lucene (5.5.0) for some string tokenization (no indexing). I need to:
- Completely remove words containing numbers, so for example words like log4j should be removed from the string
- I would like to split my string into one word terms and also into 2-Grams terms, so for example: "tie a yellow ribbon" should be tokenized to the following terms: "tie", "yellow", "ribbon", "yellow ribbon". Note that "tie yellow" is not a term since it has a stop word in the middle
Are these possible to do with Lucene? If so how?
What I've done so far:
- Regarding removing words containing numbers I encountered WordDelimetedFilter which is not good since in the documentation it shows it splits the word SD500 into "SD" and "500" while I want to remove it entirely. I also found a NumericPayloadTokenFilter which looks promising (by the name) but I'm having some issues understanding how to work with it
- Regarding the 2Grams and 1Grams, I've found several examples of how to do that here, here and in the NGramTokenizer documentation however they all seem to be working on chars and not on words which is what I need
Thanks in advance