2

The Stanford POS Tagger docs (http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h) claim the tagger can do 15,000 words a second. However, I'm getting about 7 words a second. I'm using the english-left3words-distsim.tagger as the docs recommended. Am I doing something wrong? Is this the result of running it with the nltk library?

from nltk.tag import StanfordPOSTagger
jar = '/Users/marie/Desktop/StandfordParser/stanford-postagger-2015-12-09/stanford-postagger.jar'
model = '/Users/marie/Desktop/StandfordParser/stanford-postagger-2015-12-09/models/english-left3words-distsim.tagger'
tagger = StanfordPOSTagger(model, jar)

tokens = word_tokenize("What's the airspeed of an unladen swallow ?")

%timeit tagger.tag(tokens)

1 loop, best of 3: 1.01 s per loop
marie
  • 417
  • 3
  • 5
  • 15
  • There are many overheads when you are calling Stanford tools through NLTK (for now, until https://github.com/nltk/nltk/pull/1249 is merged). See also http://stackoverflow.com/a/23322996/610569 – alvas Sep 30 '16 at 23:58
  • 3
    You're discounting start-up costs. Call it with 15.000 tokens (with `.tag_sents()`) and see how long it takes. – alexis Oct 01 '16 at 08:41
  • 1
    Thanks for your help! It looks like 5000 sentences take about 5 secs to tag using tag_sents(). – marie Oct 03 '16 at 16:57

0 Answers0