1

I'm trying to use Lucene (5.5.0) for some string tokenization (no indexing). I need to:

  1. Completely remove words containing numbers, so for example words like log4j should be removed from the string
  2. I would like to split my string into one word terms and also into 2-Grams terms, so for example: "tie a yellow ribbon" should be tokenized to the following terms: "tie", "yellow", "ribbon", "yellow ribbon". Note that "tie yellow" is not a term since it has a stop word in the middle

Are these possible to do with Lucene? If so how?

What I've done so far:

  • Regarding removing words containing numbers I encountered WordDelimetedFilter which is not good since in the documentation it shows it splits the word SD500 into "SD" and "500" while I want to remove it entirely. I also found a NumericPayloadTokenFilter which looks promising (by the name) but I'm having some issues understanding how to work with it
  • Regarding the 2Grams and 1Grams, I've found several examples of how to do that here, here and in the NGramTokenizer documentation however they all seem to be working on chars and not on words which is what I need

Thanks in advance

Community
  • 1
  • 1
Gideon
  • 2,211
  • 5
  • 29
  • 47

1 Answers1

1

On requirement 1: I'm not aware of anything that does this, out of the box. NumericPayloadTokenFilter is definitely not what you need. You will probably need to create your own token filter to do this.

On requirement 2: NGrams, in Lucene parlance, are generally based on characters. What you want is ShingleFilter, which combines tokens. It will create shingles at stop words, like: tie _ and _ yellow, where _ is a generic filler token.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • Writing the filter was fairly easy to get done by looking at some filters examples. Thanks for your answer – Gideon Mar 01 '16 at 12:30