I am trying to build an Dataflow pipeline and it works fine without spacy. After I introduce spacy it start failing with the error below:
return _create_pardo_operation(
File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1589, in _create_pardo_operation
dofn_data = pickler.loads(serialized_fn)
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/pickler.py", line 289, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 577, in _load_type
return _reverse_typemap[name]
KeyError: 'ClassType'
ParDoCode here:
@beam.typehints.with_input_types(PubsubMessage)
@beam.typehints.with_output_types(beam.pvalue.TaggedOutput)
class PayloadOutput(beam.DoFn):
def process(self, element):
yield beam.pvalue.TaggedOutput(element.attributes['payload'],element)
splitme = (pipeline
| "Read from Pub/Sub"
>> beam.io.ReadFromPubSub(
subscription=input_subscription,
with_attributes=True
)
| 'Split Payload' >> beam.ParDo(PayloadOutput()).with_outputs('message','rtbf')
Code using spacy:
def remove_PII(message, language_code, found_product_names):
""" De-identify text by masking PII such as people's names, email addresses and phone/credit card numbers """
""" Mask people's names """
lang = language_code[:2].lower() # Get language
# Dictionary of spacy models for different languages
spacy_keys = {'en':'en_core_web_sm', 'fr':'fr_core_news_sm', 'nl':'nl_core_news_sm', \
'da':'da_core_news_sm', 'pt':'pt_core_news_sm', 'es':'es_core_news_sm'}
nlp = spacy.load(spacy_keys[lang]) # load spacy model
I tried to look for the related issues, found github bug but dont know how to fix this one