29

I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far.

Anybody got any recommendations?

Glenn
  • 7,874
  • 3
  • 29
  • 38

3 Answers3

16

Are you looking to tag POS in a specific domain? Most of the general purpose taggers are trained on newswire text. Typically they don't perform well when you are using them in specific domains (such and biomedical text). There are other taggers specifically trained for such domains such as dTagger (java) for biomedical text.

For newswire text, Adwait Ratnaparkhi's MXPOST is very good and is the one I would recommend.

Other Java implementations include:

  1. MontyLingua
  2. Berkeley Parser (Not really a POS tagger but all full blown parsers will typically include POS taggers. Google for Java syntactic parsers and you will find many.)
  3. QTag
  4. LBJ

OpenNLP and Lingpipe as posted by the other posters are also pretty decent.

Info on the state-of-the-art on POS tagging can be found here. As you can see LTAG-Spinal (also mentioned by another poster) ranks best as of now, but the variation across the various taggers is not much. I have not used LTAG myself.

Also note that the baseline performance for POS tagging is about 90%. Baseline means - (a) tag every word by most frequent POS tag from a lexicon, and (b) tag every unknown word as a noun.

mayhewsw
  • 704
  • 9
  • 20
hashable
  • 3,791
  • 2
  • 23
  • 22
  • Your MXPOST link is to an FTP site with a compressed archive. I searched around and couldn't find much about MXPOST other than it being one guy's CS thesis. Am I correct in assuming that there isn't much community support for MXPOST? – Glenn Feb 27 '10 at 22:09
  • 1
    @Glenn Yes. Although OPENNLP seems to be an equivalent implementation of MXPOST. I quote from the OPENNLP site: 1. *If you are familiar with feature selection for Adwait Ratnaparkhi's maxent implementation, you should have no problems since our implementation [of the POS tagger] uses features in the same manner as his.* and 2. *His[Adwait's] introduction to maxent for NLP and dissertation are what really made opennlp.maxent and our Grok maxent components (POS tagger, end of sentence detector, tokenizer, name finder) possible!* OpenNLP appears to have an active sourceforge community. – hashable Mar 01 '10 at 08:57
  • In the end, it was LingPipe that worked out the best for me. It was the best in terms of being able to easily embed within another system. It did a pretty good job at POS tagging too. – Glenn May 22 '11 at 20:44
3

I have used OpenNLP with good results. You can also check out MorphAdorner.

Shashikant Kore
  • 4,952
  • 3
  • 31
  • 40
3

I've used both LingPipe and Stanford's POS Tagger. The later is a state-of-the-art POS Tagger but, from my experience, it is too slow (although they do provide less accurate models, which are reasonably fast). Of course, it always depends on what you are trying to achieve, and there will always be a trade-off between speed and accuracy.

I've also once used an LBJ-based NER software and, although it was pretty accurate, the source code was a complete mess. Both LingPipe and Stanford's source is very clean and well documented.

You can also take a look at LTAG-spinal. I haven't used it yet, but from the algorithm description, and from the listed accuracy, it sure seems better than the alternatives you have so far.

Hope it helps.

João Silva
  • 89,303
  • 29
  • 152
  • 158
  • 6
    Stanford's best model is moderately slow. But, actually, LTAG-spinal is 3 times slower again and insignificantly better. For general purpose use, we recommend the left3words model: tagging with it is of similar or better speed than with Ratnaparkhi's or the OpenNLP tagger but is more accurate than either. Find [more info](http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h) in the Stanford POS tagger FAQ. – Christopher Manning Sep 03 '10 at 03:18
  • I cannot find any comparison with OpenNlp there (only with other taggers) - am I overlooking something? – benroth Dec 09 '13 at 20:02
  • @ChristopherManning I just did a 10 fold cross validation using Penn Treebank. it seems that left3words is slightly worse than opennlp. But Bidirectional is indeed better. Could you tell more about the data on which you did the comparison? Thanks! – Wei Qiu Jul 24 '14 at 15:38
  • @benroth: Fair enough, OpenNLP doesn't appear in that FAQ comment. – Christopher Manning Jul 25 '14 at 18:39
  • @Wei Qiu: This comment was based on experiments I ran comparing various taggers in 2010. I just looked up the results. I was using opennlp-1.4.3 (the then-current, pre-Apache release at the time). At that time, opennlp's accuracy (maxent model) on Penn Treebank WSJ sections 22-24 was trivially worse than Stanford POS tagger (96.80% vs. 96.87%) but it was considerably slower (10.71 seconds vs. 6.92 seconds). I haven't repeated this exercise recently. – Christopher Manning Jul 25 '14 at 18:47
  • @all: note that Standford's tagger uses a GPL license, meaning that most likely you will need to open up your code base. Why no Apache-style license Standford? – U Avalos Jul 07 '16 at 05:40