Lucene Analyzer tokenizer for substring search

Question

I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed

wine
bott
cap
ottl
aps
wine bottl

Here is what I have so far. How might I modify it to work? No query less than three characters should work.

public class PorterAnalyzer extends Analyzer {

  private final Version version;

  public PorterAnalyzer(Version version) {
    this.version = version;
  }

  @Override
  @SuppressWarnings("resource")
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    final StandardTokenizer src = new StandardTokenizer(reader);
    TokenStream tok = new StandardFilter(src);
    tok = new LowerCaseFilter( tok);
    tok = new StopFilter( tok, StandardAnalyzer.STOP_WORDS_SET);
    tok = new PorterStemFilter(tok);
    return new TokenStreamComponents(src, tok);
  }

}

score 1 · Accepted Answer · edited Jun 01 '15 at 17:34

1

I think you are searching for NGramTokenFilter.

Try, for example:

tok=new NGramTokenFilter(tok,2,5);

edited Jun 01 '15 at 17:34

femtoRgon

32,893
7
60
87

answered Jun 01 '15 at 17:19

timo

26
5

could you explain a bit what the minGram and maxGram does? Basically how do I know what these values should be for diverse queries? – Katedral Pillon Jun 01 '15 at 17:37
When you search for "ottl", you get no results because in your index only contains "bottle". But you can add a NGramTokenFilter in your indexing filter chain. This would add all possible parts with at least minGram (i.e. 2) and maximum maxGram letters (i.e. 5). So for example with 2 letters: bo, ot, tt ... and 3 letters bot, ott, ottl, ... and 4... and 5 letters: bottl and ottle – timo Jun 01 '15 at 18:21
than it sounds like maxGram is a problem. I would have to compute maxGram on the fly for each document I need to add to the index. So for example if a title is "programming language java"; then I need to have maxGram recomputed as 11 so that `programmi` would get a hit. having to recompute maxGram each time is not scalable. I need to be able to instantiate a writer as `writer = new IndexWriter(index, config)` and use it to add a number of documents. That does not seem possible with having to specify maxGram each time. **Do you know any other approach?** – Katedral Pillon Jun 01 '15 at 18:49
You don't need to recompute maxGram. If you want "all" parts of a word you can use a very high value for maxGram, i.e. 50 - so it would work for all words with up to 51 letters. – timo Jun 01 '15 at 19:41

Lucene Analyzer tokenizer for substring search

1 Answers1