I need to write up a proper blog post on this. The tl;dr is that spaCy is implemented in Cython, a Python-like language that transpiles into C or C++, and ultimately produces a Python extension. You can read more about releasing the GIL with Cython here:
http://docs.cython.org/src/userguide/parallelism.html
Here's the implementation of the .pipe method in spaCy:
https://github.com/spacy-io/spaCy/blob/master/spacy/syntax/parser.pyx#L135
def pipe(self, stream, int batch_size=1000, int n_threads=2):
cdef Pool mem = Pool()
cdef TokenC** doc_ptr = <TokenC**>mem.alloc(batch_size, sizeof(TokenC*))
cdef int* lengths = <int*>mem.alloc(batch_size, sizeof(int))
cdef Doc doc
cdef int i
cdef int nr_class = self.moves.n_moves
cdef int nr_feat = self.model.nr_feat
cdef int status
queue = []
for doc in stream:
doc_ptr[len(queue)] = doc.c
lengths[len(queue)] = doc.length
queue.append(doc)
if len(queue) == batch_size:
with nogil:
for i in cython.parallel.prange(batch_size, num_threads=n_threads):
status = self.parseC(doc_ptr[i], lengths[i], nr_feat, nr_class)
if status != 0:
with gil:
sent_str = queue[i].text
raise ValueError("Error parsing doc: %s" % sent_str)
PyErr_CheckSignals()
for doc in queue:
self.moves.finalize_doc(doc)
yield doc
queue = []
batch_size = len(queue)
with nogil:
for i in cython.parallel.prange(batch_size, num_threads=n_threads):
status = self.parseC(doc_ptr[i], lengths[i], nr_feat, nr_class)
if status != 0:
with gil:
sent_str = queue[i].text
raise ValueError("Error parsing doc: %s" % sent_str)
PyErr_CheckSignals()
for doc in queue:
self.moves.finalize_doc(doc)
yield doc
The actual mechanics of the multi-threading are super simple, because NLP is (often) embarrassingly parallel --- every document is parsed independently, so we just need to make a prange loop over a stream of texts.
Implementing the parser in a multi-threaded way was quite hard, though. To use multi-threading effectively, you need to release the GIL, and not reacquire it. This means making no use of Python objects, not raising exceptions, etc.
When you create a Python object --- let's say a list --- you need to increment its reference count, which is stored globally. This means acquiring the GIL. There's no way around that. But if you're in a C extension and you just want to, say, put an integer on the stack, or make a call to malloc or free, you don't need to acquire the GIL. So if you write the program at that level, using only C and C++ constructs, you can release the GIL.
I've been writing statistical parsers in Cython for a few years now. (Before spaCy I had an implementation for my academic research.) Getting the entire parsing loop written without the GIL was hard. By late 2015 I had the machine learning, hash table, outer parsing loop, and most of the feature extraction as nogil code. But the state object had a complicated interface, and was implemented as a cdef class. I couldn't create this object or store it in a container without acquiring the GIL.
The break-through came when I figured out an undocumented way to write a C++ class in Cython. This allowed me to hollow out the existing cdef class that controlled the parser state. I proxied its interface to the inner C++ class, method by method. This way I could keep the code working, and make sure I didn't introduce any subtle bugs into the feature calculation.
You can see the inner class here: https://github.com/spacy-io/spaCy/blob/master/spacy/syntax/_state.pxd
If you navigate around the git history of this file, you can see the patches where I implemented the .pipe method.