I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators:
tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref
along with the shift-reduce parser model englishSR.ser.gz
. I'm mainly using CoreNLP for its co-reference resolution / named entity recognition, and as far as I'm aware I'm using the minimal set of annotators for this purpose.
What methods can I take to speed up the annotation of documents?
The other SO answers all suggest not loading the models for every document, but I'm already doing that (since the wrapper starts the server once and then passes documents/results back and forth).
The documents I am processing have an average length of 20 sentences, with some as long as 400 sentences and some as short as 1. The average parse time per sentence is 1 second. I can parse ~2500 documents per day with one single-threaded process running on one machine, but I'd like to double that (if not more).