Python NLTK pos_tag not returning the correct part-of-speech tag

Question

Having this:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

And running:

nltk.pos_tag(text)

I get:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.

Using their example sentence "John's big idea isn't all that bad." located here: http://www.nltk.org/api/nltk.tag.html I see it does a great job and finds the adjectives, why is that ? Where do you suggest to look for another tagger? — faceoff, Jun 13 '15 at 17:13
These are unambiguous words and should have been tagged correctly. Try some other tagger for better results. — Riyaz, Jun 13 '15 at 17:13
@faceoff I don't know any tagger for this task, but I just suggested to look for one in Doc! — Mazdak, Jun 13 '15 at 17:15
You can use HMM tagger in NLTK. I have a self implementation of HMM tagger which correctly predicted the tags for your example sentence. I suppose it should work for you as well. — Riyaz, Jun 13 '15 at 17:19

score 70 · Accepted Answer · edited May 23 '17 at 12:34

In short:

NLTK is not perfect. In fact, no model is perfect.

Note:

As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

Still it's better but not perfect:

>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli

In long:

Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

HunPos
Stanford POS
Senna

Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

Using Stanford POS tagger:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

Or try building a better POS tagger:

Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/

Complains about pos_tag accuracy on stackoverflow include:

Issues about NLTK HunPos include:

Issues with NLTK and Stanford POS tagger include:

Yeah yeah, no model is perfect, but this example is pretty disappointing. Considering all the technology that went into this "recommended" tagger, it's not unreasonable to expect more. — alexis, Jun 13 '15 at 22:52
It has been 3 years since the model is update, possibly we should raise this to `nltk-dev` google group: https://github.com/arne-cl/nltk-maxent-pos-tagger. And the model was created 7 years ago =( https://github.com/nltk/nltk/blob/develop/nltk/tag/__init__.py#L84 — alvas, Jun 13 '15 at 23:49
By the look of it `Stanford` and `Senna` are superior taggers, isn't it? — Houman, Aug 06 '15 at 08:14
Yes, stanford and senna tagger are more complicated and lots of effort were put in to build the tools from both groups. — alvas, Aug 06 '15 at 08:21
@alvas Thank you for the amazing answer! It's still (sadly) pretty relevant in 2017 as I have been working with NLTK in the past few months — tech4242, Aug 18 '17 at 10:11
@tech4242, genau. Given a larger annotated corpus, it might be possible to reach a better tagger's accuracy. — alvas, Aug 18 '17 at 10:16
mmm..in one sentence it was correctly tagging "change" as verb whereas in another sentence it was incorrectly tagging "change" as noun! bizzare — Shan, Nov 18 '17 at 16:00

score 2 · Answer 2 · answered Oct 28 '19 at 06:29

Solutions such as changing to the Stanford or Senna or HunPOS tagger will definitely yield results, but here is a much simpler way to experiment with different taggers that are also included within NLTK.

The default POS tagger in NTLK right now is the averaged perceptron tagger. Here's a function that will opt to use the Maxent Treebank Tagger instead:

def treebankTag(text)
    words = nltk.word_tokenize(text)
    treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')
    return treebankTagger.tag(words)

I have found that the averaged perceptron pre-trained tagger in NLTK is biased to treating some adjectives as nouns, as in your example. The treebank tagger has gotten more adjectives correct for me.

interesting but with this mode is "The quick brown fox jumps over the lazy dog." "jumps" tagged as a noun not a verb. — luky, Oct 08 '21 at 08:40

score 1 · Answer 3 · edited May 05 '21 at 16:21

def tagPOS(textcontent, taggedtextcontent, defined_tags):
    # Write your code here
    token = nltk.word_tokenize(textcontent)
    nltk_pos_tags = nltk.pos_tag(token)
    
    unigram_pos_tag = nltk.UnigramTagger(model=defined_tags).tag(token)
    
    tagged_pos_tag = [ nltk.tag.str2tuple(word) for word in taggedtextcontent.split() ]
    
    return (nltk_pos_tags,tagged_pos_tag,unigram_pos_tag)

Python NLTK pos_tag not returning the correct part-of-speech tag

3 Answers3

Linked

Related