I have 300.000 news articles in my dataset and I'm using en_core_web_sm to do POS tagging, parsing, ner extraction. However, this is taking hours and hours and seems to never be done.
The code is working, but very slow. When I take a sample of my dataset 650 articles it takes 37seconds and 6500 articles takes 6min. However I really need my full dataset of 300.000 articles to be completed in a reasonable amount of time...
texts = df["content"]
for doc in nlp.pipe(texts, n_process=2, batch_size=100, disable=["senter","attribute_ruler" "lemmatizer"]):
df["spacy_sm"] = texts
Is there a way to speed this up significantly, or am I missing something here?