Java Lucene Ngrams

Question

I want to use the Lucene API to extract ngrams from sentences. However I seem to be running into a peculiar problem. In the JavaDoc there is a class called NGramTokenizer. I have downloaded both the 3.6.1 and 4.0 API's and I do not see any trace of this class. For example when I try the following I get an error stating that the symbol NGramTokenizer cannot be found:

NGramTokenizer myTokenizer;

In the documentation it appears that the NGramTokenizer is in the path org.apache.lucene.analysis.NGramTokenizer. I do not see this anywhere on my computer. It does not seem likely that a download or other miscellaneous error has occurred since this happens with both the 3.6.1 and 4.0 API's

How can I obtain the NGramTokenizer class?
I added the lucene-core-3.6.1.jar to my project

score 3 · Accepted Answer · answered Nov 10 '12 at 05:53

3

You are using the wrong jar. It's in

lucene-analyzers-3.6.1.jar

org.apache.lucene.analysis.ngram.NGramTokenizer

answered Nov 10 '12 at 05:53

Mawia

4,220
13
40
55

score 0 · Answer 2 · answered Oct 24 '15 at 12:05

Here is a utility method I usually use incase someone needs help with this. Should work with lucene 4.10 (I didn't test with lower or higher versions)

private Set<String> generateNgrams(String sentence, int ngramCount) {
    StringReader reader = new StringReader(sentence);
    Set<String> ngrams = new HashSet<>();

    //use lucene's shingle filter to generate the tokens
    StandardTokenizer source = new StandardTokenizer(reader);
    TokenStream tokenStream = new StandardFilter(source);
    TokenFilter sf = null;

    //if only unigrams are needed use standard filter else use shingle filter
    if(ngramCount == 1){
        sf = new StandardFilter(tokenStream);
    }
    else{
        sf = new ShingleFilter(tokenStream);
        ((ShingleFilter)sf).setMaxShingleSize(ngramCount);
    }

    CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
    try {
        sf.reset();
        while (sf.incrementToken()) {
            String token = charTermAttribute.toString().toLowerCase();
            ngrams.add(token);
        }
        sf.end();
        sf.close();
    } catch (IOException ex) {
       // System.err.println("Scream and cry as desired");
      ex.printStackTrace();
    }
    return ngrams;
}

Maven dependencies required for lucene are:

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>4.10.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>4.10.3</version>
    </dependency>

Java Lucene Ngrams

2 Answers2