including custom PTransform causes not found dependencies in the Dataflow job in GCP

Question

I was trying to create a composite PTransform as following (Python):

class LimitVolume(beam.PTransform):
def __init__(self, daily_window, daily_limit):
    super().__init__()
    self.daily_window = daily_window
    self.daily_limit = daily_limit

def expand(self, input_events_pcoll):
    events_with_ts_pcol = (input_events_pcoll
        | 'Timestamp using RECEIVED_TIMESTAMP' >> beam.Map(
            lambda message: beam.window.TimestampedValue(message, message['RECEIVED_TIMESTAMP']))
        )
     ...
    return events_with_ts_pcol

and then use it in the main run() method as following:

      def run():
          ...
          result_pcol = input_pcol | LimitVolume(daily_window, daily_limit)

both run() and LimitVolume are in the same main.py script, which is then submitted/deployed as a job into GCP

When I run this job locally via DirectRunner - everything works fine; If I submit and run it using DataflowRunner in GCP - it start throwing errors like:

in process NameError: name 'arrow' is not defined [while running 'Parse Json-ptransform-898945'] 
and in <lambda> NameError: name 'time' is not defined [while running 'define schedule-ptransform-899107']

basically not finding a lot of dependencies that are all defined in the requirements.txt file and specified via the --requirements_file option when deploying the job

See the full error stacktrace (abbreviated) below.

Now, the punch line:

If I put the same logic I have in the LimitVolume PTransform into the run() method and specify in my pipeline directly:

def run():
    ...
    events_with_ts_pcol = (input_pcol
                       | 'Timestamp using RECEIVED_TIMESTAMP' >> beam.Map(
            lambda message: beam.window.TimestampedValue(message, message['RECEIVED_TIMESTAMP']))
                       )
    ...

and remove the definition of LimitVolume Class from the main.py file - it WORKS fine both locally and in GCP! no issues with dependencies.

So, clearly there is something very "special" about the sole existence of a custom PTransform in the pipeline - anybody knows what that might be?

I could not find any information on either custom PTransforms, or specifics of packaging with it, or errors like this - which in itself is worrisome ...

Thank you!!

Here is a larger output of the errors:

 File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 571, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs File "/Users/mpopova/Marina/TT_Projects/gcp_inboundconverter/ibc_ingest_dataflow/src/main.py", line 45, in process NameError: name 'arrow' is not defined During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 284, in _execute response = task() File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 357, in <lambda> lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 602, in do_instruction getattr(request, request_type), request.instruction_id) File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 639, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 997, in process_bundle element.data) File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 222, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 351, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 353, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 712, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 713, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1234, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 571, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs File "/Users/mpopova/Marina/TT_Projects/gcp_inboundconverter/ibc_ingest_dataflow/src/main.py", line 45, in process NameError: name 'arrow' is not defined [while running 'Parse Json-ptransform-898945'] passed through: ==> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:771 
...
line 1234, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1299, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 571, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1395, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 712, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 713, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1234, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process File "/Users/mpopova/Marina/TT_Projects/gcp_inboundconverter/ibc_ingest_dataflow/venv/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1562, in <lambda> wrapper = lambda x: [fn(x)] File "/Users/mpopova/Marina/TT_Projects/gcp_inboundconverter/ibc_ingest_dataflow/src/main.py", line 273, in <lambda> NameError: name 'time' is not defined [while running 'define schedule-ptransform-899107'] passed through: ==> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:771

score 1 · Answer 1 · answered Nov 11 '21 at 21:11

1

This sounds like issues with capturing/pickling objects in the __main__ session. You could try passing the save_main_session flag. Personally, I prefer the solution of putting all required objects (imports, definition, etc.) that are in the main module inside a run() method to be sure they're captured correctly.

Note also that there is an effort to migrate to cloudpickle to avoid these limitations.

answered Nov 11 '21 at 21:11

robertwb

4,891
18
21

thanks, @robertwb! I already set the flag save_main_sesstion to 'true' when creating the pipeline: options = PipelineOptions(beam_args, save_main_session=True, streaming=True) – Marina Nov 11 '21 at 21:14
save_main_session is a hack that doesn't solve all issues; I would put things inside your run() method (so they're locals, not `__main__` globals) or in a separate module (that can be properly reference by name). – robertwb Nov 11 '21 at 21:15
as for putting everything into run() - yea, , sure, that's what I have to do now - but it makes it impossible to properly unit test the individual transforms. The whole idea of having a custom PTransform was to be able to unit-test it with all kinds of scenarios ... having it all as one monolith in run() makes it much much harder – Marina Nov 11 '21 at 21:16
If it's something intended to be reused, you could put it in its own module and that should work fine as well. (I realize this is not ideal, hopefully cloudpickle will be the right long-term solution.) – robertwb Nov 11 '21 at 23:04

score 1 · Accepted Answer · answered Nov 16 '21 at 23:28

There is an unknown issue [1] of Apache Beam:

Using --save_main_session fails on Python 3 when main module has invocations of superclass method using 'super' .

This issue has been open for a couple of years, however, it hasn't been fixed due to a dependency of Beam called Dill. The issue of Dill can be found on Github issues [2].

As mentioned by one of the comments from Github issues here [3], the workaround is:

For a class declared in main, for example:

class MyClass(SuperClass)

When called the init function from parent class: super().__init__()

The explicit class name should be used:

SuperClass.__init__()

In your code, the change should be:

class LimitVolume(beam.PTransform):
    def __init__(self, daily_window, daily_limit):
        beam.PTransform.__init__(self)
        ...

In the meantime, the error NameError: name 'time' is not defined can also be related to another issue of Apache Beam Python dependencies import mechanism. As mentioned by @robertwb, if the issue happens in the __main__ session, you can set the --save_main_session pipeline option to True.

However, if the error happens outside of it, you can solve this by importing the module locally, where it is used. (credit to Google Dataflow document here [4])

For example, instead of:

import re
…
def myfunc():
  # use re module

Use:

def myfunc():
  import re
  # use re module

[1] https://issues.apache.org/jira/browse/BEAM-6158

[2] https://github.com/uqfoundation/dill/issues/300

[3] https://github.com/uqfoundation/dill/issues/300#issuecomment-505011149

[4] https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors

Thank you!!! Changing super().__init__() to beam.PTransform.__init__(self) solved my issue! — Marina, Nov 17 '21 at 02:42

including custom PTransform causes not found dependencies in the Dataflow job in GCP

2 Answers2