I'm developing a text analysis project using apache lucene. I need to lemmatize some text (transform the words to their canonical forms). I've already written the code that makes stemming. Using it, I am able to convert the following sentence
The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production
into
stem part word never chang even when morpholog inflect lemma base form word exampl from produc lemma produc stem produc becaus word product
However, I need to get the base forms of the words: example instead of exampl, produce instead of produc, and so on.
I am using lucene because it has analyzers for many languages (I need at least English and Russian). I know about Stanford NLP library, but it has no Russian language support.
So is there any way to do lemmatization for several languages like I do stemming using lucene?
The simplified version of my code responsible for stemming:
//Using apache tika to identify the language
LanguageIdentifier identifier = new LanguageIdentifier(text);
//getting analyzer according to the language (eg, EnglishAnalyzer for 'en')
Analyzer analyzer = getAnalyzer(identifier.getLanguage());
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
String stem = stream.getAttribute(CharTermAttribute.class).toString();
// doing something with the stem
System.out.print(stem+ " ");
}
stream.end();
stream.close();
UPDATE: I found the library that does almost what I need (for English and Russian languages) and uses apache lucene (although in its own way), it's definitely worth exploring.