1

I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed

  • wine
  • bott
  • cap
  • ottl
  • aps
  • wine bottl

Here is what I have so far. How might I modify it to work? No query less than three characters should work.

public class PorterAnalyzer extends Analyzer {

  private final Version version;

  public PorterAnalyzer(Version version) {
    this.version = version;
  }

  @Override
  @SuppressWarnings("resource")
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final StandardTokenizer src = new StandardTokenizer(reader);
    TokenStream tok = new StandardFilter(src);
    tok = new LowerCaseFilter( tok);
    tok = new StopFilter( tok, StandardAnalyzer.STOP_WORDS_SET);
    tok = new PorterStemFilter(tok);
    return new TokenStreamComponents(src, tok);
  }

}
Katedral Pillon
  • 14,534
  • 25
  • 99
  • 199

1 Answers1

1

I think you are searching for NGramTokenFilter.

Try, for example:

tok=new NGramTokenFilter(tok,2,5);
femtoRgon
  • 32,893
  • 7
  • 60
  • 87
timo
  • 26
  • 5
  • could you explain a bit what the minGram and maxGram does? Basically how do I know what these values should be for diverse queries? – Katedral Pillon Jun 01 '15 at 17:37
  • When you search for "ottl", you get no results because in your index only contains "bottle". But you can add a NGramTokenFilter in your indexing filter chain. This would add all possible parts with at least minGram (i.e. 2) and maximum maxGram letters (i.e. 5). So for example with 2 letters: bo, ot, tt ... and 3 letters bot, ott, ottl, ... and 4... and 5 letters: bottl and ottle – timo Jun 01 '15 at 18:21
  • than it sounds like maxGram is a problem. I would have to compute maxGram on the fly for each document I need to add to the index. So for example if a title is "programming language java"; then I need to have maxGram recomputed as 11 so that `programmi` would get a hit. having to recompute maxGram each time is not scalable. I need to be able to instantiate a writer as `writer = new IndexWriter(index, config)` and use it to add a number of documents. That does not seem possible with having to specify maxGram each time. **Do you know any other approach?** – Katedral Pillon Jun 01 '15 at 18:49
  • You don't need to recompute maxGram. If you want "all" parts of a word you can use a very high value for maxGram, i.e. 50 - so it would work for all words with up to 51 letters. – timo Jun 01 '15 at 19:41