6

I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:

from spacy.en import English

input = open("big_file.txt")
big_text= input.read()
input.close()

nlp= English()    

out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next() 

Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:

out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)

but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.

Sajjad Bay
  • 197
  • 2
  • 9

1 Answers1

7

I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.

Sajjad Bay
  • 197
  • 2
  • 9
  • Hi SJ, Any idea how do you get the sentences out of pipe. I have a huge doc like you. A text string. I do nlp.pipe(txt). Now I was hoping I can simply use .sents on this but it doesn't work. How do you extract nlp pieces from pipe generator – Baktaawar Mar 27 '18 at 22:50
  • The syntax has changed in new version of Spacy. You first load a language model and then apply nlp to a text. "The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component." https://spacy.io/usage/processing-pipelines – Sajjad Bay Mar 30 '18 at 22:12