Tokenize, remove stop words using Lucene with Java

Question

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

public String removeStopWords(String string) throws IOException {

Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("an");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);

    StringBuilder sb = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(token.toString());
    System.out.println(sb);    
    }
    return sb.toString();
}}

My main looks like this:

    String file = "..../datatest.txt";

    TestFileReader fr = new TestFileReader();
    fr.imports(file);
    System.out.println(fr.content);

    String text = fr.content;

    Stopwords stopwords = new Stopwords();
    stopwords.removeStopWords(text);
    System.out.println(stopwords.removeStopWords(text));

This is giving me an error but I can't figure out why.

What is the error you are seeing? – femtoRgon Jul 13 '13 at 00:54 — femtoRgon, Jul 13 '13 at 00:54
it complains on while (tokenStream.incrementToken()) – whyname Jul 15 '13 at 13:11 — whyname, Jul 15 '13 at 13:11

score 11 · Answer 1 · answered May 16 '14 at 15:54

I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

The code below shows the correct version of your removeStopWords():

public static String removeStopWords(String textFile) throws Exception {
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        sb.append(term + " ");
    }
    return sb.toString();
}

To use a custom list of stop words use the following:

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);

Here are current, working examples: https://docs.leponceau.org/java-examples/java-evaluation/org.apache.lucene.StandardAnalyzerTest.html and https://docs.leponceau.org/java-examples/java-evaluation/org.apache.lucene.EnglishAnalyzerTest.html — user1050755, Feb 17 '19 at 03:08
@user1050755 The linked `EnglishAnalyzer` version works, the `StandardAnalyzer` one doesn't remove any words though, as you probably have to give it a list of stop words. How do you do that? Please also post this code as an answer, as the other code above is outdated and doesn't work with the later versions of Lucene anymore (I'm using 8.6.3). — Neph, Oct 12 '20 at 14:48

score 0 · Answer 2 · answered Mar 02 '14 at 07:11

0

you may try to call tokenStream.reset() before calling tokenStream.incrementToken()

answered Mar 02 '14 at 07:11

user3370153

1

score 0 · Answer 3 · answered Oct 22 '20 at 15:37

Lucene changed and because of that the suggested answer (posted in 2014) won't compile. This is a slightly altered version of the code @user1050755 linked that works with Lucene 8.6.3 and Java 8:

final String text = "This is a short test!"
final List<String> stopWords = Arrays.asList("short","test"); //Filters both words
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    ArrayList<String> remaining = new ArrayList<String>();

    Analyzer analyzer = new StandardAnalyzer(stopSet); // Filters stop words in the given "stopSet"
    //Analyzer analyzer = new StandardAnalyzer(); // Only filters punctuation marks out of the box, you have to provide your own stop words!
    //Analyzer analyzer = new EnglishAnalyzer(); // Filters the default English stop words (see link below)
    //Analyzer analyzer = new EnglishAnalyzer(stopSet); // Only uses the given "stopSet" but also runs a stemmer, so the result might not look like what you expected.
    
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
        remaining.add(term.toString());
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    e.printStackTrace();
}

You can find the default stop words for the EnglishAnalyzer on the official Github (here).

The printed results:

StandardAnalyzer(stopSet): [this] [is] [a]
StandardAnalyzer(): [this] [is] [a] [short] [test]
EnglishAnalyzer(): [this] [short] [test]
EnglishAnalyzer(stopSet): [thi] [is] [a] (no, this isn't a typo, it really outputs thi!)

It is possible to combine the default stop words and your own but it's best to use a CustomAnalyzer for that (check out this answer).

Tokenize, remove stop words using Lucene with Java

3 Answers3

Linked