1

I have a serious problem: I have downloaded last version of NLTK and I got a strange POS output:

import nltk
import re

sample_text="start please with me"
tokenized = nltk.sent_tokenize(sample_text)  

for i in tokenized:
            words=nltk.word_tokenize(i)
            tagged=nltk.pos_tag(words)
            chunkGram=r"""Chank___Start:{<VB|VBZ>*}  """                           
            chunkParser=nltk.RegexpParser(chunkGram)
            chunked=chunkParser.parse(tagged)
            print(chunked) 

[out]:

(S start/JJ please/NN with/IN me/PRP)

I do not know why "start" is tagged as JJ and "please" as NN?

alvas
  • 115,346
  • 109
  • 446
  • 738
Micha agus
  • 531
  • 2
  • 5
  • 12
  • 1
    The issue is probably caused by the improper english used in the sentence. NLTK is effective with proper english but grammatically incorrect sentences will cause problems. `start please with me` is a sentence fragment. Also I imagine there are more errors with your code, because I tried this sentence on NLTK POS tagger here: http://textanalysisonline.com/nltk-pos-tagging and it worked just fine: 'start|NN please|NN with|IN me|PRP' – Rob Mar 02 '16 at 02:15
  • 1
    Use a different POS tagger? perhaps the default one is not so good / robust to malformed English. – Tomer Levinboim Mar 02 '16 at 02:15
  • How to use another ? – Micha agus Mar 02 '16 at 02:24
  • what is the wrong with my sentence – Micha agus Mar 02 '16 at 02:25
  • Possible duplicate of [Python NLTK pos\_tag not returning the correct part-of-speech tag](http://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag) – alvas Mar 02 '16 at 02:27

1 Answers1

3

The default NLTK pos_tag has somehow learnt that please is a noun. And that's not correct in almost any case in proper English, e.g.

>>> from nltk import pos_tag
>>> pos_tag('Please go away !'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please'.split())
[('Please', 'VB')]
>>> pos_tag('please'.split())
[('please', 'NN')]
>>> pos_tag('please !'.split())
[('please', 'NN'), ('!', '.')]
>>> pos_tag('Please !'.split())
[('Please', 'NN'), ('!', '.')]
>>> pos_tag('Would you please go away ?'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
>>> pos_tag('Would you please go away !'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please go away ?'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]

Using WordNet as a benchmark, there shouldn't be a case where please is a noun.

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('please')
[Synset('please.v.01'), Synset('please.v.02'), Synset('please.v.03'), Synset('please.r.01')]

But I think this is largely due to the text which was used to train the PerceptronTagger rather than the implementation of the tagger itself.

Now, we take a look at what's inside the pre-trained PerceptronTragger, we see that it only knows 1500+ words:

>>> from nltk import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> tagger.tagdict['I']
'PRP'
>>> tagger.tagdict['You']
'PRP'
>>> tagger.tagdict['start']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'start'
>>> tagger.tagdict['Start']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Start'
>>> tagger.tagdict['please']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'please'
>>> tagger.tagdict['Please']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Please'
>>> len(tagger.tagdict)
1549

One trick you can do is to hack the tagger:

>>> tagger.tagdict['start'] = 'VB'
>>> tagger.tagdict['please'] = 'VB'
>>> tagger.tag('please start with me'.split())
[('please', 'VB'), ('start', 'VB'), ('with', 'IN'), ('me', 'PRP')]

But the most logical thing to do is to simply retrain the tagger, see http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train


And if you don't want to retrain a tagger, then see Python NLTK pos_tag not returning the correct part-of-speech tag

Most probably, using the StanfordPOSTagger gets you what you need:

>>> from nltk import StanfordPOSTagger
>>> sjar = '/home/alvas/stanford-postagger/stanford-postagger.jar'
>>> m = '/home/alvas/stanford-postagger/models/english-left3words-distsim.tagger'
>>> spos_tag = StanfordPOSTagger(m, sjar)
>>> spos_tag.tag('Please go away !'.split())
[(u'Please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Please'.split())
[(u'Please', u'VB')]
>>> spos_tag.tag('Please !'.split())
[(u'Please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please !'.split())
[(u'please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please'.split())
[(u'please', u'VB')]
>>> spos_tag.tag('Would you please go away !'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Would you please go away ?'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'?', u'.')]

For Linux: See https://gist.github.com/alvations/e1df0ba227e542955a8a

For Windows: See https://gist.github.com/alvations/0ed8641d7d2e1941b9f9

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738