1

I am trying to build an Dataflow pipeline and it works fine without spacy. After I introduce spacy it start failing with the error below:

    return _create_pardo_operation(
  File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1589, in _create_pardo_operation
    dofn_data = pickler.loads(serialized_fn)
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/pickler.py", line 289, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 577, in _load_type
    return _reverse_typemap[name]
KeyError: 'ClassType'

ParDoCode here:


@beam.typehints.with_input_types(PubsubMessage)
@beam.typehints.with_output_types(beam.pvalue.TaggedOutput)
class PayloadOutput(beam.DoFn):


    def process(self, element):
        yield beam.pvalue.TaggedOutput(element.attributes['payload'],element)

splitme = (pipeline
            | "Read from Pub/Sub"
            >> beam.io.ReadFromPubSub(
                subscription=input_subscription,
                with_attributes=True
            )
            | 'Split Payload' >> beam.ParDo(PayloadOutput()).with_outputs('message','rtbf')

Code using spacy:


def remove_PII(message, language_code, found_product_names):
    """ De-identify text by masking PII such as people's names, email addresses and phone/credit card numbers """

    """ Mask people's names """
    lang = language_code[:2].lower() # Get language
    # Dictionary of spacy models for different languages
    spacy_keys = {'en':'en_core_web_sm', 'fr':'fr_core_news_sm', 'nl':'nl_core_news_sm', \
              'da':'da_core_news_sm', 'pt':'pt_core_news_sm', 'es':'es_core_news_sm'}

    nlp = spacy.load(spacy_keys[lang]) # load spacy model

I tried to look for the related issues, found github bug but dont know how to fix this one

https://github.com/uqfoundation/dill/issues/217

ankie
  • 11
  • 2
  • What spaCy version is this? – polm23 Oct 22 '21 at 09:20
  • I suspect some of the dependencies you are using are not available to Dataflow at runtime. Please see here for more information in including dependencies for a Dataflow pipeline: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ – chamikara Oct 25 '21 at 13:37

1 Answers1

0

You need to import this class “import dill” in file A and File B.

In file B, you need to invoke the following line of code

dill._dill._reverse_typemap['ClassType'] = type

This is how you use these lines of code. You can see this documentation.

File A
import dill

File B

import dill
dill.dill._reverse_typemap['ClassType'] = type

# do deserialization
dill.loads(some_serialized_string)

You can see this other solution in this example code. Here is the documentation.

import sys
import os

def pickle_dumps_without_main_refs(obj):
    """
    Yeah this is horrible, but it allows you to pickle an object in the main module so that it can be reloaded in another
    module.
    :param obj:
    :return:
    """
    currently_run_file = sys.argv[0]
    module_path = file_path_to_absolute_module(currently_run_file)
    pickle_str = pickle.dumps(obj, protocol=0)
    pickle_str = pickle_str.replace('__main__', module_path)  # Hack!
    return pickle_str


def pickle_dump_without_main_refs(obj, file_obj):
    string = pickle_dumps_without_main_refs(obj)
    file_obj.write(string)


def file_path_to_absolute_module(file_path):
    """
    Given a file path, return an import path.
    :param file_path: A file path.
    :return:
    """
    assert os.path.exists(file_path)
    file_loc, ext = os.path.splitext(file_path)
    assert ext in ('.py', '.pyc')
    directory, module = os.path.split(file_loc)
    module_path = [module]
    while True:
        if os.path.exists(os.path.join(directory, '__init__.py')):
            directory, package = os.path.split(directory)
            module_path.append(package)
        else:
            break
    path = '.'.join(module_path[::-1])
    return path

Now, I can simply change dill_pickle_script_1.py to say

import time
from artemis.remote.child_processes import pickle_dump_without_main_refs

def my_func(a, b):
    time.sleep(0.1)
    return a+b

if __name__ == '__main__':
    with open('testfile.pkl', 'wb') as f:
        pickle_dump_without_main_refs(my_func, f)

And then dill_pickle_script_2.py.

Raul Saucedo
  • 1,614
  • 1
  • 4
  • 13