16

I am looking for a way given an English text count verb phrases in it in past, present and future tenses. For now I am using NLTK, do a POS (Part-Of-Speech) tagging, and then count say 'VBD' to get past tenses. This is not accurate enough though, so I guess I need to go further and use chunking, then analyze VP-chunks for specific tense patterns. Is there anything existing that does that? Any further reading that might be helpful? The NLTK book is focused mostly on NP-chunks, and I can find quite few info on VP-chunks.

sth
  • 222,467
  • 53
  • 283
  • 367
Michael Pliskin
  • 2,352
  • 4
  • 26
  • 42

2 Answers2

10

Thee exact answer depends on which chunker you intend to use, but list comprehensions will take you a long way. This gets you the number of verb phrases using a non-existent chunker.

len([phrase for phrase in nltk.Chunker(sentence) if phrase[1] == 'VP'])

You can take a more fine-grained approach to detect numbers of tenses.

mobeets
  • 450
  • 3
  • 13
Tim McNamara
  • 18,019
  • 4
  • 52
  • 83
  • Thanks for the pointer, that's what I am gonna use - my next question is whether there is something existing to detect tense patterns. For each VP I'd like to know what tense is it in. – Michael Pliskin Aug 09 '10 at 10:55
  • 2
    I actually managed to solve my problem with this approach, so tagging this as accepted answer. The following article is really helpful: http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/ – Michael Pliskin Aug 16 '10 at 12:46
  • Hi Michael, great to hear that things are working well for you! – Tim McNamara Aug 17 '10 at 00:04
1

You can do this with either the Berkeley Parser or Stanford Parser. But I don't know if there's a Python interface available for either.

Jerry Stratton
  • 3,287
  • 1
  • 22
  • 30
ars
  • 120,335
  • 23
  • 147
  • 134
  • 1
    Thanks a lot, this might be an option - however as I am heavily using NLTK already, it might be quite a lot of work to switch. Will look though. – Michael Pliskin Aug 09 '10 at 10:59
  • 2
    There is an interface for the Stanford Parser in the NLTK. You can use it as follows: `tagger = nltk.tag.stanford.POSTagger('models/german-fast.tagger', 'stanford-postagger.jar')` You may have to encode the strings to UTF-8 first (at least for the German model). – Suzana Mar 21 '13 at 16:44