4

I have to process hundreds of thousands of texts. I have found that the thing that is taking the longest in the following:

nlp = English()
ruler = EntityRuler(nlp)
patterns = [...]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
...
#This line takes longer than I would like
doc = nlp(whole_chat)

Granted, I have many patterns. But is there a way to speed this up? I only have the entity ruler pipe, no others.

formicaman
  • 1,317
  • 3
  • 16
  • 32
  • For anyone coming here, there's now an official speed FAQ for spaCy with the advice from the answers here and more. https://github.com/explosion/spaCy/discussions/8402 – polm23 Jan 07 '22 at 05:11

2 Answers2

7

By default, Spacy applies lots of models to your document: POS tagger, a syntactic parser, NER, a document categorizer, and maybe something else.

Maybe you do not need some of these models. If it is the case, you can disable them, which will speed up your pipeline. You do it when creating the pipeline, like this:

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

Or, following the @oleg-ivanytskiy's answer, you can disable these models in the nlp.pipe() call:

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
David Dale
  • 10,958
  • 44
  • 73
2

Use nlp.pipe() to process multiple texts. It is faster and more efficient (documentation)

Oleg Ivanytskyi
  • 959
  • 2
  • 12
  • 28