7

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:

public String removeStopWords(String string) throws IOException {

Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("an");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string));
    tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords);

    StringBuilder sb = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(token.toString());
    System.out.println(sb);    
    }
    return sb.toString();
}}

My main looks like this:

    String file = "..../datatest.txt";

    TestFileReader fr = new TestFileReader();
    fr.imports(file);
    System.out.println(fr.content);

    String text = fr.content;

    Stopwords stopwords = new Stopwords();
    stopwords.removeStopWords(text);
    System.out.println(stopwords.removeStopWords(text));

This is giving me an error but I can't figure out why.

GreatDane
  • 683
  • 1
  • 9
  • 31
whyname
  • 93
  • 1
  • 1
  • 4

3 Answers3

11

I had The same problem. To remove stop-words using Lucene you could either use their Default Stop Set using the method EnglishAnalyzer.getDefaultStopSet();. Otherwise, you could create your own custom stop-words list.

The code below shows the correct version of your removeStopWords():

public static String removeStopWords(String textFile) throws Exception {
    CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_48, new StringReader(textFile.trim()));

    tokenStream = new StopFilter(Version.LUCENE_48, tokenStream, stopWords);
    StringBuilder sb = new StringBuilder();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        sb.append(term + " ");
    }
    return sb.toString();
}

To use a custom list of stop words use the following:

//CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet(); //this is Lucene set 
final List<String> stop_Words = Arrays.asList("fox", "the");
final CharArraySet stopSet = new CharArraySet(Version.LUCENE_48, stop_Words, true);
user692704
  • 544
  • 6
  • 21
  • what imports are needed to make the above code work? – Junaid Shirwani Sep 10 '15 at 14:11
  • 1
    Here are current, working examples: https://docs.leponceau.org/java-examples/java-evaluation/org.apache.lucene.StandardAnalyzerTest.html and https://docs.leponceau.org/java-examples/java-evaluation/org.apache.lucene.EnglishAnalyzerTest.html – user1050755 Feb 17 '19 at 03:08
  • @user1050755 The linked `EnglishAnalyzer` version works, the `StandardAnalyzer` one doesn't remove any words though, as you probably have to give it a list of stop words. How do you do that? Please also post this code as an answer, as the other code above is outdated and doesn't work with the later versions of Lucene anymore (I'm using 8.6.3). – Neph Oct 12 '20 at 14:48
0

you may try to call tokenStream.reset() before calling tokenStream.incrementToken()

0

Lucene changed and because of that the suggested answer (posted in 2014) won't compile. This is a slightly altered version of the code @user1050755 linked that works with Lucene 8.6.3 and Java 8:

final String text = "This is a short test!"
final List<String> stopWords = Arrays.asList("short","test"); //Filters both words
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    ArrayList<String> remaining = new ArrayList<String>();

    Analyzer analyzer = new StandardAnalyzer(stopSet); // Filters stop words in the given "stopSet"
    //Analyzer analyzer = new StandardAnalyzer(); // Only filters punctuation marks out of the box, you have to provide your own stop words!
    //Analyzer analyzer = new EnglishAnalyzer(); // Filters the default English stop words (see link below)
    //Analyzer analyzer = new EnglishAnalyzer(stopSet); // Only uses the given "stopSet" but also runs a stemmer, so the result might not look like what you expected.
    
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
        remaining.add(term.toString());
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    e.printStackTrace();
}

You can find the default stop words for the EnglishAnalyzer on the official Github (here).

The printed results:

  • StandardAnalyzer(stopSet): [this] [is] [a]
  • StandardAnalyzer(): [this] [is] [a] [short] [test]
  • EnglishAnalyzer(): [this] [short] [test]
  • EnglishAnalyzer(stopSet): [thi] [is] [a] (no, this isn't a typo, it really outputs thi!)

It is possible to combine the default stop words and your own but it's best to use a CustomAnalyzer for that (check out this answer).

Neph
  • 1,823
  • 2
  • 31
  • 69