Python NLTK and Regexp

Question

I am trying to tokenize text, use the POS tagger and then chunk its output using a customized "pattern"(see below). These are my install import repositories and then the pos tagged output.

from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

pos =  [(u'max', 'NN'), (u'workpiece', 'NN'), (u'diameter', 'NN'), (u'250', 'CD'), (u'mm', 'NN'), (u'threading', 'VBG'), (u'length', 'NN'), (u'800', 'CD'), (u'mm', 'NN'), (u'max', 'NN'), (u'module', 'NN'), (u'5', 'CD'), (u'total', 'NN'), (u'power', 'NN'), (u'requirement', 'NN'), (u'5', 'CD'), (u'kW', 'NNP')]

I am trying to tweak the POS chunker I've created in the following way:

pattern = r""" 
          FEAT: {<NN><NN>+}
                {<VBG><NN>}
           VAL: {<CD><NN|NNP>}
           """

My current output:

(S
  (ATTR max/NN workpiece/NN diameter/NN)
  (VAL 250/CD mm/NN)
  (ATTR threading/VBG length/NN)
  800/CD
  (ATTR mm/NN max/NN module/NN)
  5/CD
  (ATTR total/NN power/NN requirement/NN)
  (VAL 5/CD kW/NNP)

My required output:

(S
  (ATTR max/NN workpiece/NN diameter/NN)
  (VAL 250/CD mm/NN)
  (ATTR threading/VBG length/NN)
  (VAL 800/CD mm/NN)
  (ATTR max/NN module/NN)
  5/CD
  (ATTR total/NN power/NN requirement/NN)
  (VAL 5/CD kW/NNP)

How can I customize this chucking pattern such that the 800(CD) mm (NN) will also be considered as a VAL. I thought my VAL code expresses: find one token tagged CD followed by a token tag NN. And what approach should I take in achieving this?

Thanks

score 1 · Answer 1 · answered Jun 02 '15 at 13:38

Not sure if I understand what exactly you are after, and it would help if you format your example a bit nicer, and explain what you are actually doing with your pattern variable. But my guess would be; by making the NN|NNP part optional? Something like this?:

import nltk

pos = [('max', 'NN'), ('workpiece', 'NN'), ('diameter', 'NN'), ('250', 'CD'), ('mm', 'NN'), ('threading', 'VBG'), ('length', 'NN'), ('5', 'CD'), ('800', 'CD'), ('mm', 'NN'), ('max', 'NN'), ('module', 'NN')]

pattern = r"""
        FEAT: {<NN><NN>+}
        {<VBG><NN>}
        VAL: {<CD><NN|NNP>?}
        """

parser = nltk.RegexpParser(pattern)
print(parser.parse(pos))

Output:

(S
  (FEAT max/NN workpiece/NN diameter/NN)
  (VAL 250/CD mm/NN)
  (FEAT threading/VBG length/NN)
  (VAL 5/CD)
  (VAL 800/CD)
  (FEAT mm/NN max/NN module/NN))

Try turning around the order of your grammar/chunker rules (so that the VAL one comes first). The nltk parser for this is quite triggerhappy and doesn't allow for multiple parse trees, so it will take the first match. — Igor, Jun 03 '15 at 16:17

Python NLTK and Regexp

1 Answers1