1

I have 300.000 news articles in my dataset and I'm using en_core_web_sm to do POS tagging, parsing, ner extraction. However, this is taking hours and hours and seems to never be done.

The code is working, but very slow. When I take a sample of my dataset 650 articles it takes 37seconds and 6500 articles takes 6min. However I really need my full dataset of 300.000 articles to be completed in a reasonable amount of time...

texts = df["content"]
for doc in nlp.pipe(texts, n_process=2, batch_size=100, disable=["senter","attribute_ruler" "lemmatizer"]):
df["spacy_sm"] = texts

Is there a way to speed this up significantly, or am I missing something here?

1 Answers1

1

On CPU you can usually use a much larger batch size and you can use a larger number of processes depending on your CPU. The default batch size for sm/md/lg models is currently 256 but you can increase it further as long as you aren't running out of RAM. A batch size of 1000 or higher would be totally reasonable on CPU for many tasks, but it really depends on the pipeline components and text lengths.

As long as your text lengths don't vary too much, you can figure out the approximate RAM required for a given batch size with one process and then you can estimate how many processes you can fit in the available RAM for that batch size.

aab
  • 10,858
  • 22
  • 38