0

First I thank anyone who takes the time to help. The internet community is so essential for learning.

Overall goal: I am inputting .txt file, stemming it using a Java build of The 2003 CIIR KStemmer in Eclipse, and outputting a list of stemmed words into a different .txt file.

Easy: inputting the .txt, sorting the .txt into an array of strings or chars, outputting the .txt

Problem: I don't understand how to use the stemmer within my main code.

I have included the CIIR code in a class file (KStemmer.java) and imported the following libraries:

apache-lucene-analyzers.jar

apache-lucene.jar

lucene-analyzers-common-4.2.0.jar

lucene-core-3.4.0.jar

In my main class (StemThis.Java) I want to do something like this:

String wordFromTextFile = new String();  // input word
String stemmedWord = new String();      // output word
printer = new PrintWriter("outputFile") // for file export

KStemmer newStemmer = new KStemmer(); // creating a stemmer
newStemmer.stem(wordFromTextFile);  // stemming a word
stemmedWord = newStemmer.return();  // get stemmed word from stemmer

printer.println(stemmedWord);  // desired output method

This is obviously too simple. Maybe the KStemmer does not work this way. How do I put strings into a KStemmer and get an output?

2 Answers2

1

please keep in mind KStemmer() is a default constructor it don't have any access specifiers so you can't call the in your own code by using import package org.apache.lucene.analysis.en;

one solution is use PorterStemFilter but it is an aggressive stemmer.

second solution is download all the source files and include them in your own package and change the package name.

ck reddy
  • 494
  • 7
  • 17
0

KStemmer doesn't have any public methods, which indicates that it is meant to be called indirectly. The KStemFilter sits next to it in the org.apache.lucene.analysis.en package, so I'm pretty sure the stemmer is meant to be used on a TokenStream (e.g. as part of an analysis chain).

Here is a simple main class that turns a java.io.Reader into a TokenStream, passes it through the KStemFilter and then prints out the tokens.

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.en.KStemFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

public class KStemmerTestMain {

    public static void main(String[] args) throws IOException {

        Reader inputReader = new StringReader("I liked to tested that my inputs are stemming fantastically");

        TokenStream whitespaceTokenizer = new WhitespaceTokenizer(Version.LUCENE_42, inputReader);
        TokenStream kStemmedTokenStream = new KStemFilter(whitespaceTokenizer);

        // This attribute is updated in place every time incrementToken() is called.
        CharTermAttribute charTermAttribute = kStemmedTokenStream.addAttribute(CharTermAttribute.class);

        // Many TokenStreams are stateful and must be reset before calling incrementToken()
        kStemmedTokenStream.reset();
        while (kStemmedTokenStream.incrementToken()) {
            String term = charTermAttribute.toString();
            System.out.print(term + " ");
        }
    }
}

Here are the Maven dependencies I used to get the above imports to resolve:

<dependencies>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>4.2.1</version>
    </dependency>
</dependencies>
tgray
  • 8,826
  • 5
  • 36
  • 41