I am trying to tokenize text, use the POS tagger and then chunk its output using a customized "pattern"(see below). These are my install import repositories and then the pos tagged output.
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
pos = [(u'max', 'NN'), (u'workpiece', 'NN'), (u'diameter', 'NN'), (u'250', 'CD'), (u'mm', 'NN'), (u'threading', 'VBG'), (u'length', 'NN'), (u'800', 'CD'), (u'mm', 'NN'), (u'max', 'NN'), (u'module', 'NN'), (u'5', 'CD'), (u'total', 'NN'), (u'power', 'NN'), (u'requirement', 'NN'), (u'5', 'CD'), (u'kW', 'NNP')]
I am trying to tweak the POS chunker I've created in the following way:
pattern = r"""
FEAT: {<NN><NN>+}
{<VBG><NN>}
VAL: {<CD><NN|NNP>}
"""
My current output:
(S
(ATTR max/NN workpiece/NN diameter/NN)
(VAL 250/CD mm/NN)
(ATTR threading/VBG length/NN)
800/CD
(ATTR mm/NN max/NN module/NN)
5/CD
(ATTR total/NN power/NN requirement/NN)
(VAL 5/CD kW/NNP)
My required output:
(S
(ATTR max/NN workpiece/NN diameter/NN)
(VAL 250/CD mm/NN)
(ATTR threading/VBG length/NN)
(VAL 800/CD mm/NN)
(ATTR max/NN module/NN)
5/CD
(ATTR total/NN power/NN requirement/NN)
(VAL 5/CD kW/NNP)
How can I customize this chucking pattern such that the 800(CD) mm (NN) will also be considered as a VAL. I thought my VAL code expresses: find one token tagged CD followed by a token tag NN. And what approach should I take in achieving this?
Thanks