0

I use below code. However, the outcome is not what I expected. The outcome is [machine, Learning] But I want to get [machine, learn]. How can I do this? Also, when my input is "biggest bigger", I wanna get the result like [big, big], but the outcome is just [biggest bigger]

(PS: I just add these four jars in my eclipse:joda-time.jar, stanford-corenlp-3.3.1-models.jar, stanford-corenlp-3.3.1.jar, xom.jar Do I need add some more?)

import java.util.LinkedList;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

public class StanfordLemmatizer {

    protected StanfordCoreNLP pipeline;

    public StanfordLemmatizer() {
        // Create StanfordCoreNLP object properties, with POS tagging
        // (required for lemmatization), and lemmatization
        Properties props;
        props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");


        this.pipeline = new StanfordCoreNLP(props);
    }

    public List<String> lemmatize(String documentText)
    {
        List<String> lemmas = new LinkedList<String>();
        // Create an empty Annotation just with the given text
        Annotation document = new Annotation(documentText);
        // run all Annotators on this text
        this.pipeline.annotate(document);
        // Iterate over all of the sentences found
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        for(CoreMap sentence: sentences) {
            // Iterate over all tokens in a sentence
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                // Retrieve and add the lemma for each word into the
                // list of lemmas
                lemmas.add(token.get(LemmaAnnotation.class));
            }
        }
        return lemmas;
    }


    // Test
    public static void main(String[] args) {
        System.out.println("Starting Stanford Lemmatizer");
        String text = "Machine Learning\n";
        StanfordLemmatizer slem = new StanfordLemmatizer();
        System.out.println(slem.lemmatize(text));
    }

}
CSnerd
  • 2,129
  • 8
  • 22
  • 45

1 Answers1

4

Lemmatization should ideally return a canonical form (known as 'lemma' or 'headword') of a group of words. This canonical form, however, is not always what we intuitively expect. For example, you expect "learning" to be yield the lemma "learn". But the noun "learning" has the lemma "learning", while only the present continuous verb "learning" has the lemma "learn". In case of ambiguity, the lemmatizer should depend on information from the part-of-speech tag.

Well, that explains machine learning, but what about big, bigger and biggest?

Lemmatization depends on morphological analysis. The Stanford Morphology-class computes the base form of English words, by removing just inflections (not derivational morphology). That is, it only does noun plurals, pronoun case, and verb endings, and not things like comparative adjectives or derived nominals. It is based on a finite-state transducer implemented by John Carroll et al., written in flex. I couldn't find the original version, but a Java version seems to be available here.

That is why biggest will not yield big.

The WordNet lexical database resolves to the correct lemma, though. I have usually used WordNet for lemmatization tasks, and have found no major issues so far. Two other well known tools that handle your example correctly are

  1. CST Lemmatizer
  2. MorphAdorner
Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
  • I do not know why it shows `[machine, Learning]` not `[machine, learning]`? why the L should still be uppercase? – CSnerd Apr 16 '14 at 04:16
  • I am guessing that the POS tag for "learning" is "NNP" (proper noun), and that is why it returns a capitalized word. Could you print out the POS tags and check this? – Chthonic Project Apr 16 '14 at 04:21
  • Sorry! I am newbie here. I do not know how to print POS tags..Can you tell me how to do this? – CSnerd Apr 16 '14 at 04:24
  • `token.get(PartOfSpeechAnnotation.class)` – Chthonic Project Apr 16 '14 at 04:26
  • it shows `NNP JJS JJR`. I do not know what does this mean..Do you know how to change to make the Learning to be learning? – CSnerd Apr 16 '14 at 04:32
  • `String#toLowerCase()`. Also, with all due respect ... did you invest any time to understand how the Stanford NLP system works, or did you just expect a copy-pasted code to magically work perfectly? – Chthonic Project Apr 16 '14 at 04:36
  • Sorry, I just briefly read this `http://nlp.stanford.edu/software/corenlp.shtml`, Does this "String#toLowerCase()" mean that you suggest I just use Java method, like "Word".toLowerCase()? – CSnerd Apr 16 '14 at 04:41
  • No. I clearly stated the String class. Also, I added a more precise reason to the answer. Stanford lemmatizer will not do what you are looking for. You need to use one of the other options. – Chthonic Project Apr 16 '14 at 04:44