I am doing text mining with Python3-NLTK. in Preprocessing step I wanted to implement Noun-phrase chunking, which requires POS-tagging and selection according to a regexp (grammar). My results were not satisfying, for example I had the phrase the president of
because pos_tag had identified it as DT,NN,NNP
(how on earth of
might be a proper noun!)
so I delved into NLTK pos_tag, and find this peculiar result:
With
import nltk
nltk.help.upenn_tagset()
you'll get a complete list of available tags in NLTK pos_tag. this is one of them:
CD: numeral, cardinal : mid-1890 nine-thirty forty-two ...
As stated, the word mid-1980 is a CD. now this is what I've got:
from nltk import pos_tag
t1 = 'it happened in the mid-1890s'
pos_tag(t1.split())[-1] # gives:('mid-1890s', 'NNS')
t2 = 'it happened in the mid-1890'
pos_tag(t2.split())[-1] # gives:('mid-1890', 'NN')
t3 = 'mid-1890'
pos_tag(t3.split())[-1] # gives:('mid-1890', 'NN')
t4 = 'mid-1890s'
pos_tag(t4.split())[-1] # gives:('mid-1890s', 'NNS')
Isn't this situation odd?!
Is there any (probably supervised) method for POS-tagging improvement? I'm working on over 11'000 docs (up to 500 words for each doc)