6

I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators:

tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref

along with the shift-reduce parser model englishSR.ser.gz. I'm mainly using CoreNLP for its co-reference resolution / named entity recognition, and as far as I'm aware I'm using the minimal set of annotators for this purpose.

What methods can I take to speed up the annotation of documents?

The other SO answers all suggest not loading the models for every document, but I'm already doing that (since the wrapper starts the server once and then passes documents/results back and forth).

The documents I am processing have an average length of 20 sentences, with some as long as 400 sentences and some as short as 1. The average parse time per sentence is 1 second. I can parse ~2500 documents per day with one single-threaded process running on one machine, but I'd like to double that (if not more).

Ayrton Massey
  • 471
  • 3
  • 13
  • any update on this? – Anish Nov 13 '16 at 04:02
  • @Ngeunpo I stopped working on this project a long time ago, but I remember reading some advice on the CoreNLP website that different parser models can speed up your processing time. Try different models and see which one is fastest for you; the default one is *really* slow. You can also [reduce the range that dcoref checks](https://stanfordnlp.github.io/CoreNLP/coref.html#statistical-system) for mentions (see coref.maxMentionDistance) – Ayrton Massey Nov 22 '16 at 15:14

2 Answers2

1

Try setting up the Stanford CoreNLP server than loading annotators on each run. That way you can load annotators once and process documents lot faster. The first process will be slower, but the rest are lot faster. See more details for Stanford CoreNLP server.

Having said that, it is often a tradeoff between accuracy and speed. So you may want to do due diligence with other tools like NLTK and spacy to see what works best for you.

1

One thing to note is that the length of sentences has a very large impact on parsing time for some portions of the core NLP library. I would recommend not trying to parse sentences that are more than 100 tokens long.

One way to approach this is to have two different pipelines: a tokenizer / sentence splitter, and a full pipeline. The sentence splitter pipeline can determine how long a sentence is and then you can decide if you want to mitigate its length somehow (e.g. by ignoring the sentence, or splitting it into multiple sentences). The second pipeline only works on documents / sentences that are smaller than the maximum length you allow.

While this approach doesn't speed up the average case, it can massively improve worst-case performance. The trade-off is that there are legitimate sentences that are longer than you would expect.

David
  • 3,251
  • 18
  • 28