Why is pos_tag in NLTK tagging "please" as NN?

Question

I have a serious problem: I have downloaded last version of NLTK and I got a strange POS output:

import nltk
import re

sample_text="start please with me"
tokenized = nltk.sent_tokenize(sample_text)  

for i in tokenized:
            words=nltk.word_tokenize(i)
            tagged=nltk.pos_tag(words)
            chunkGram=r"""Chank___Start:{<VB|VBZ>*}  """                           
            chunkParser=nltk.RegexpParser(chunkGram)
            chunked=chunkParser.parse(tagged)
            print(chunked)

[out]:

(S start/JJ please/NN with/IN me/PRP)

I do not know why "start" is tagged as JJ and "please" as NN?

The issue is probably caused by the improper english used in the sentence. NLTK is effective with proper english but grammatically incorrect sentences will cause problems. `start please with me` is a sentence fragment. Also I imagine there are more errors with your code, because I tried this sentence on NLTK POS tagger here: http://textanalysisonline.com/nltk-pos-tagging and it worked just fine: 'start|NN please|NN with|IN me|PRP' — Rob, Mar 02 '16 at 02:15
Use a different POS tagger? perhaps the default one is not so good / robust to malformed English. — Tomer Levinboim, Mar 02 '16 at 02:15
Possible duplicate of [Python NLTK pos\_tag not returning the correct part-of-speech tag](http://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag) — alvas, Mar 02 '16 at 02:27

score 3 · Answer 1 · edited May 23 '17 at 11:50

The default NLTK pos_tag has somehow learnt that please is a noun. And that's not correct in almost any case in proper English, e.g.

>>> from nltk import pos_tag
>>> pos_tag('Please go away !'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please'.split())
[('Please', 'VB')]
>>> pos_tag('please'.split())
[('please', 'NN')]
>>> pos_tag('please !'.split())
[('please', 'NN'), ('!', '.')]
>>> pos_tag('Please !'.split())
[('Please', 'NN'), ('!', '.')]
>>> pos_tag('Would you please go away ?'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
>>> pos_tag('Would you please go away !'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please go away ?'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]

Using WordNet as a benchmark, there shouldn't be a case where please is a noun.

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('please')
[Synset('please.v.01'), Synset('please.v.02'), Synset('please.v.03'), Synset('please.r.01')]

But I think this is largely due to the text which was used to train the PerceptronTagger rather than the implementation of the tagger itself.

Now, we take a look at what's inside the pre-trained PerceptronTragger, we see that it only knows 1500+ words:

>>> from nltk import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> tagger.tagdict['I']
'PRP'
>>> tagger.tagdict['You']
'PRP'
>>> tagger.tagdict['start']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'start'
>>> tagger.tagdict['Start']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Start'
>>> tagger.tagdict['please']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'please'
>>> tagger.tagdict['Please']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Please'
>>> len(tagger.tagdict)
1549

One trick you can do is to hack the tagger:

>>> tagger.tagdict['start'] = 'VB'
>>> tagger.tagdict['please'] = 'VB'
>>> tagger.tag('please start with me'.split())
[('please', 'VB'), ('start', 'VB'), ('with', 'IN'), ('me', 'PRP')]

But the most logical thing to do is to simply retrain the tagger, see http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train

And if you don't want to retrain a tagger, then see Python NLTK pos_tag not returning the correct part-of-speech tag

Most probably, using the StanfordPOSTagger gets you what you need:

>>> from nltk import StanfordPOSTagger
>>> sjar = '/home/alvas/stanford-postagger/stanford-postagger.jar'
>>> m = '/home/alvas/stanford-postagger/models/english-left3words-distsim.tagger'
>>> spos_tag = StanfordPOSTagger(m, sjar)
>>> spos_tag.tag('Please go away !'.split())
[(u'Please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Please'.split())
[(u'Please', u'VB')]
>>> spos_tag.tag('Please !'.split())
[(u'Please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please !'.split())
[(u'please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please'.split())
[(u'please', u'VB')]
>>> spos_tag.tag('Would you please go away !'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Would you please go away ?'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'?', u'.')]

For Linux: See https://gist.github.com/alvations/e1df0ba227e542955a8a

For Windows: See https://gist.github.com/alvations/0ed8641d7d2e1941b9f9

Why is pos_tag in NLTK tagging "please" as NN?

1 Answers1