1

I want to run some multiprocessing module to run some phrase matching on documents in parallel. To do this I thought of creating phrase matching object in one process and then share among multiple processes by creating copy of the PhraseMatcher object. The code seems to be failing with out giving anykind of error. To make things easier I have tried this to demonstrate what I am trying to achieve

import copy
import spacy
from spacy.matcher import PhraseMatcher


nlp = spacy.load('en')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)


matcher2 = copy.deepcopy(matcher)

doc = nlp("yellow fabric")
matches = matcher2(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    print(rule_id, span.text)

With the matcher2 object it not giving any output, but with matcher object I am able to get the results.

COLOR yellow
MATERIAL yellow fabric

I am stuck at this for couple of days. Any help will be deeply appreciated.

Thank you.

Anurag Sharma
  • 4,839
  • 13
  • 59
  • 101

1 Answers1

1

The root of your problem is that PhraseMatcher is a Cython class, defined and implemented in the file matcher.pyx and Cython does not work properly with deepcopy.

Referenced from the accepted answer to this StackOverflow question:

Cython doesn't like deepcopy on Classes which have function/method referenced variables. Those variable copies will fail.

However, there are alternatives to that. If you want to run PhraseMatcher to multiple documents in parallel you can use multithreading with the pipe method of PhraseMatcher.

A possible workaround for your problem:

import copy
import spacy
from spacy.matcher import PhraseMatcher


nlp = spacy.load('en_core_web_sm')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)

doc1 = nlp('yellow fabric')
doc2 = nlp('red lipstick and big black boots')

for doc in matcher.pipe([doc1, doc2], n_threads=4):
    matches = matcher(doc)
    for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]
        span = doc[start : end]
        print(rule_id, span.text)

Hope it helps!

gdaras
  • 9,401
  • 2
  • 23
  • 39
  • 1
    The keyword argument n_threads on the .pipe methods is now deprecated, as the v2.x models cannot release the global interpreter lock. (Future versions may introduce a n_process argument for parallel inference via multiprocessing.) [Source](https://spacy.io/usage/v2-1) – Andrew Quaschnick Aug 12 '20 at 20:11