3

I need to build a POS tagger in Java and need to know how to get started. Are there code examples or other resources that help illustrate how POS taggers work?

Stan Murdoch
  • 31
  • 1
  • 2
  • nlp is a hard unsolved problem. you should start with looking for articles published which are dealing with your problem, choose a few suggested solutions, implement them and choose the one that yields better results for you. – amit Aug 17 '11 at 06:51
  • Hmm...do you **have** to build your own from scratch? Because if not, you could just use the Stanford one mentioned below by Andrey or the OpenNLP one mentioned by WColen. Stanford's my preference; it is quite nice. If you have to build one, that sounds like a homework-y project; otherwise there's really no reason to make your own (no reason I can think of at least lol). – dmn Aug 17 '11 at 18:49
  • Creating a POS tagger is large task. Ideally, you'd get an annotated corpus, parse it, get token frequencies, get likelihood estimates, smooth the data, then build the model. The model could be based on your existing stochastic data alone, using something like logit or an HMM, or you can use supervised intervention with features and Maxent or Perceptron models where you rely on features. You could also avoid probabilistic models completely by using a rule-based tagger similar to Brille's. – Victor Stoddard Feb 01 '15 at 21:25

3 Answers3

6

Try Apache OpenNLP. It includes a POS Tagger tools. You can download ready-to-use English models from here.

The documentation provides details about how to use it from a Java application. Basically you need the following:

Load the POS model

InputStream modelIn = null;

try {
  modelIn = new FileInputStream("en-pos-maxent.bin");
  POSModel model = new POSModel(modelIn);
}
catch (IOException e) {
  // Model loading failed, handle the error
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

Instantiate the POS tagger

POSTaggerME tagger = new POSTaggerME(model);

Execute it

String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had", "morning", "and", "afternoon", "newspapers", "."};          
String tags[] = tagger.tag(sent);

Note that the POS tagger expects a tokenized sentence. Apache OpenNLP also provides tools and models to help with these tasks.

If you have to train your own model refer to this documentation.

wcolen
  • 1,401
  • 10
  • 15
5

You can examine existing taggers implementations.

Refer for example to Stanford University POS tagger in Java (by Kristina Toutanova), it is available under GNU General Public License (v2 or later), source code is well written and clearly documented:

http://nlp.stanford.edu/software/tagger.shtml

Good book to read about tagging is: Speech and Language Processing (2nd Edition) by Daniel Jurafsky, James H. Martin

Andrey
  • 6,526
  • 3
  • 39
  • 58
  • I'm not sure if the Stanford POS tagger is a good implementation to start from, given its complicated (and one-off) probability model. Jurafsky & Martin is the book to read, though. – Fred Foo Aug 18 '11 at 09:09
2

There are a few POS/NER taggers used widely.

OpenNLP Maxent POS taggers: Using Apache OpenNLP.

Open NLP is a powerful java NLP library from Apache. It provides various tools for NLP one of which is Parts-Of-Speech (POS) tagger. Usually POS taggers are used to find out structure grammatical structure in text, you use a tagged dataset where each word (part of a phrase) is tagged with a label, you build an NLP model from this dataset and then for a new text you can use the model to generate tags for each word in the text.

Sample code:

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

Detailed blog with the full code on how to use it:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php?s=so

Stanford CoreNLP based NER tagger:

Stanford core NLP is by far the most battle-tested NLP library out there. In a way, it is the golden standard of NLP performance today. Among various other functionalities, named entity recognization (NER) is supported in the library, what this allows is to tag important entities in a piece of text like the name of a person, place etc.

Sample code:

public void doTagging(CRFClassifier model, String input) {
  input = input.trim();
  System.out.println(input + "=>"  +  model.classifyToString(input));
}  

Detailed blog with the full code on how to use it:

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

user439521
  • 670
  • 1
  • 6
  • 13