Lemmatization with apache lucene

Question

I'm developing a text analysis project using apache lucene. I need to lemmatize some text (transform the words to their canonical forms). I've already written the code that makes stemming. Using it, I am able to convert the following sentence

The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production

into

stem part word never chang even when morpholog inflect lemma base form word exampl from produc lemma produc stem produc becaus word product

However, I need to get the base forms of the words: example instead of exampl, produce instead of produc, and so on.

I am using lucene because it has analyzers for many languages (I need at least English and Russian). I know about Stanford NLP library, but it has no Russian language support.

So is there any way to do lemmatization for several languages like I do stemming using lucene?

The simplified version of my code responsible for stemming:

//Using apache tika to identify the language
LanguageIdentifier identifier = new LanguageIdentifier(text);
//getting analyzer according to the language (eg, EnglishAnalyzer for 'en')
Analyzer analyzer = getAnalyzer(identifier.getLanguage());
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String stem = stream.getAttribute(CharTermAttribute.class).toString();
    // doing something with the stem
    System.out.print(stem+ " ");
}
stream.end();
stream.close();

UPDATE: I found the library that does almost what I need (for English and Russian languages) and uses apache lucene (although in its own way), it's definitely worth exploring.

Did you find a way to lemmatise using Apache Lucene in the end? — Jamie Birch, Dec 29 '17 at 13:52
Hi @Jamie! Thanks for drawing attention to this question. Unfortunately I didn't find anything except that library I mentioned in my update. It works well, but supports only English and Russian languages — Kirill Simonov, Dec 30 '17 at 14:07

Kirill Simonov · Accepted Answer · 2022-01-28T21:37:04.117

In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.

First of all, you will need these dependencies (besides the lucene-core):

<!-- if you need Russain -->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>russian</artifactId>
    <version>1.1</version>
</dependency>

<!-- if you need English-->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>english</artifactId>
    <version>1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>morph</artifactId>
    <version>1.1</version>
</dependency>

Then, make sure you import the right analyzer:

import org.apache.lucene.morphology.english.EnglishAnalyzer;
import org.apache.lucene.morphology.russian.RussianAnalyzer;

These analyzers, unlike standard lucene analyzers, use MorphologyFilter which converts each word into a set of its normal forms.

So if you use the following code

String text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from \"produced\", the lemma is \"produce\", but the stem is \"produc-\". This is because there are words such as production";
Analyzer analyzer = new EnglishAnalyzer();
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String lemma = stream.getAttribute(CharTermAttribute.class).toString();
    System.out.print(lemma + " ");
}
stream.end();
stream.close();

it will print

the stem be the part of the word that never change even when morphologically inflected inflect a lemma be the base form of the word for example from produced produce the lemma be produce but the stem be produc this be because there are be word such as production

And for the Russian text

String text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";

the RussianAnalyzer will print the following:

продолжать цикл пост об астрология и наука астрология не иметь научный обоснование но являться часть частью история наука часть частью культура и общественный сознание поэтому астрологический взгляд на наука весьма интересный

Yo may notice that some words have more that one base form, e.g. inflected is converted to [inflected, inflect]. If you don't like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter (if you are interested in how exactly to do it, let me know and I'll elaborate on this).

Hope it helps, good luck!

Jason Angel · Answer 2 · 2018-03-17T16:02:27.537

1

Yep, StanfordNLP is good for english. But if you need support several language I can recommend you Freeling, check its Freeling_online_demo, please select language and output (morphological analysis for lemmatization). I dont speak russian but I think it works for this text:

Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.

For machine readability you can use the xml output (below your results) and for automatization you can integrate Freeling with python/java but usually I prefer just call it via command line.

edited Mar 17 '18 at 16:02

answered Mar 17 '18 at 15:18

Jason Angel

2,233
1
14
14

Thank you for the recommendation! I'll definitely try to use this tool in my project. The only drawback is that it's a C++ library and thus it will not be possible to fully integrate it into a Java project – Kirill Simonov Mar 17 '18 at 22:52
I'm glad to be of help. In relation to java, I think you can deal the problem using this [Freeling_APIS](https://github.com/TALP-UPC/FreeLing/tree/master/APIs) although it's not a full integration as you said, there are java APIS for linux/windows provided by the Freeling's creator. – Jason Angel Mar 18 '18 at 05:34
Does Freeling have a lemmatizer? – StackUser Oct 02 '18 at 08:30
Yes it has. You can see the lemmas if you performs POS tagger. – Jason Angel Oct 22 '18 at 16:10

Lemmatization with apache lucene

2 Answers2