explaining NLTK pos_tag ugly mistakes

Question

I am doing text mining with Python3-NLTK. in Preprocessing step I wanted to implement Noun-phrase chunking, which requires POS-tagging and selection according to a regexp (grammar). My results were not satisfying, for example I had the phrase the president of because pos_tag had identified it as DT,NN,NNP (how on earth of might be a proper noun!)

so I delved into NLTK pos_tag, and find this peculiar result:

With

import nltk
nltk.help.upenn_tagset()

you'll get a complete list of available tags in NLTK pos_tag. this is one of them:

CD: numeral, cardinal : mid-1890 nine-thirty forty-two ...

As stated, the word mid-1980 is a CD. now this is what I've got:

from nltk import pos_tag

t1 = 'it happened in the mid-1890s'
pos_tag(t1.split())[-1] # gives:('mid-1890s', 'NNS')

t2 = 'it happened in the mid-1890'
pos_tag(t2.split())[-1] # gives:('mid-1890', 'NN')

t3 = 'mid-1890'
pos_tag(t3.split())[-1] # gives:('mid-1890', 'NN')

t4 = 'mid-1890s'
pos_tag(t4.split())[-1] # gives:('mid-1890s', 'NNS')

Isn't this situation odd?!
Is there any (probably supervised) method for POS-tagging improvement? I'm working on over 11'000 docs (up to 500 words for each doc)

Nope it's not odd, https://explosion.ai/blog/part-of-speech-pos-tagger-in-python. — alvas, Mar 27 '17 at 08:40
Statistical POS taggers do all sorts of strange things. There's usually no point in asking "why", because while you could fix the particular example that bothered you, there's no good way to improve performance over all -- except by switching to a better tagger. So yes, it's a duplicate of the question pointed out by alvas. — alexis, Mar 29 '17 at 17:28

explaining NLTK pos_tag ugly mistakes

0 Answers0