2

I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
https://stackoverflow.com/a/36237769/462347

My questions:

  1. Why Lucene doesn't filter stop words?

  2. How can I enable the stop words filtering in Lucene 5.5 / 6.0?

Mike
  • 14,010
  • 29
  • 101
  • 161

2 Answers2

1

Just tested both approach 1 and approach 2, and they both seem to filter out stop words just fine. Here is how I tested it:

public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException 
{
     StandardTokenizer stdToken = new StandardTokenizer();
     stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
     TokenStream tokenStream;

     //You're code starts here
     tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
     tokenStream.reset();
     //And ends here

     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
     while (tokenStream.incrementToken()) {
         System.out.println(token.toString());
     }
     tokenStream.close();
}

Results:

some
stuff
need
analysis

Which has eliminated the four stop words in my sample.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • The problem is that `Lucene` doesn't filter out such words as `we`, `I` and other common English words. Should I attach other extended stop words dictionary? Does `Lucene` provide other stop words dictionaries? – Mike Mar 27 '16 at 06:46
  • 1
    `EnglishAnalyzer` and `StandardAnalyzer` use the same stopword set, so I don't believe there is a more extensive stop list that comes packaged in lucene. So, yes, you would probably need to create your own. If you are using StandardAnalyzer, it makes it easy to store the stop words in a plain text file, and pass a reader into the constructor. – femtoRgon Mar 27 '16 at 07:04
  • Do you mean `StandardAnalyzer` or `StandardTokenizer`? I use `StandardAnalyzer.STOP_WORDS_SET` but no constructor for `StandardAnalyzer` is used. In contrast, I have `stdToken.setReader(new StringReader(fullText));`. Where exactly should I put my stop words list? – Mike Mar 27 '16 at 07:14
  • 1
    You need to pass a `CharArraySet` of the stop words into your `StopFilter`. `StandardAnalyzer` just has a handy ctor that makes it convenient. To build from a file, you'd want to use [`WordListLoader.getWordSet`](https://lucene.apache.org/core/5_5_0/analyzers-common/org/apache/lucene/analysis/util/WordlistLoader.html#getWordSet(java.io.Reader)). Or you can just create the [`CharArraySet`](https://lucene.apache.org/core/5_5_0/analyzers-common/org/apache/lucene/analysis/util/CharArraySet.html) yourself, it's pretty sraightforward to work with, really. – femtoRgon Mar 27 '16 at 07:22
  • Great, special thanks for the `WordlistLoader.getWordSet`. – Mike Mar 27 '16 at 07:38
0

The pitfall was in the default Lucene's stop words list, I expected, it is much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();
Mike
  • 14,010
  • 29
  • 101
  • 161