36

Having this:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

And running:

nltk.pos_tag(text)

I get:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.

dmcc
  • 2,519
  • 28
  • 30
faceoff
  • 901
  • 3
  • 11
  • 16
  • 1
    Using their example sentence "John's big idea isn't all that bad." located here: http://www.nltk.org/api/nltk.tag.html I see it does a great job and finds the adjectives, why is that ? Where do you suggest to look for another tagger? – faceoff Jun 13 '15 at 17:13
  • These are unambiguous words and should have been tagged correctly. Try some other tagger for better results. – Riyaz Jun 13 '15 at 17:13
  • @faceoff I don't know any tagger for this task, but I just suggested to look for one in Doc! – Mazdak Jun 13 '15 at 17:15
  • You can use HMM tagger in NLTK. I have a self implementation of HMM tagger which correctly predicted the tags for your example sentence. I suppose it should work for you as well. – Riyaz Jun 13 '15 at 17:19

3 Answers3

70

In short:

NLTK is not perfect. In fact, no model is perfect.

Note:

As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger) 

Still it's better but not perfect:

>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli


In long:

Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

  • HunPos
  • Stanford POS
  • Senna

Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

Using Stanford POS tagger:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

Or try building a better POS tagger:


Complains about pos_tag accuracy on stackoverflow include:

Issues about NLTK HunPos include:

Issues with NLTK and Stanford POS tagger include:

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 6
    Yeah yeah, no model is perfect, but this example is pretty disappointing. Considering all the technology that went into this "recommended" tagger, it's not unreasonable to expect more. – alexis Jun 13 '15 at 22:52
  • Nice demo of the alternatives, though. – alexis Jun 13 '15 at 22:53
  • It has been 3 years since the model is update, possibly we should raise this to `nltk-dev` google group: https://github.com/arne-cl/nltk-maxent-pos-tagger. And the model was created 7 years ago =( https://github.com/nltk/nltk/blob/develop/nltk/tag/__init__.py#L84 – alvas Jun 13 '15 at 23:49
  • By the look of it `Stanford` and `Senna` are superior taggers, isn't it? – Houman Aug 06 '15 at 08:14
  • 1
    Yes, stanford and senna tagger are more complicated and lots of effort were put in to build the tools from both groups. – alvas Aug 06 '15 at 08:21
  • 1
    @alvas Thank you for the amazing answer! It's still (sadly) pretty relevant in 2017 as I have been working with NLTK in the past few months – tech4242 Aug 18 '17 at 10:11
  • @tech4242, genau. Given a larger annotated corpus, it might be possible to reach a better tagger's accuracy. – alvas Aug 18 '17 at 10:16
  • mmm..in one sentence it was correctly tagging "change" as verb whereas in another sentence it was incorrectly tagging "change" as noun! bizzare – Shan Nov 18 '17 at 16:00
2

Solutions such as changing to the Stanford or Senna or HunPOS tagger will definitely yield results, but here is a much simpler way to experiment with different taggers that are also included within NLTK.

The default POS tagger in NTLK right now is the averaged perceptron tagger. Here's a function that will opt to use the Maxent Treebank Tagger instead:

def treebankTag(text)
    words = nltk.word_tokenize(text)
    treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')
    return treebankTagger.tag(words)

I have found that the averaged perceptron pre-trained tagger in NLTK is biased to treating some adjectives as nouns, as in your example. The treebank tagger has gotten more adjectives correct for me.

  • 2
    interesting but with this mode is "The quick brown fox jumps over the lazy dog." "jumps" tagged as a noun not a verb. – luky Oct 08 '21 at 08:40
1
def tagPOS(textcontent, taggedtextcontent, defined_tags):
    # Write your code here
    token = nltk.word_tokenize(textcontent)
    nltk_pos_tags = nltk.pos_tag(token)
    
    unigram_pos_tag = nltk.UnigramTagger(model=defined_tags).tag(token)
    
    tagged_pos_tag = [ nltk.tag.str2tuple(word) for word in taggedtextcontent.split() ]
    
    return (nltk_pos_tags,tagged_pos_tag,unigram_pos_tag)
Alex Metsai
  • 1,837
  • 5
  • 12
  • 24