1

Dataflow pipeline with runtime arguments runs well using DirectRunner, but encounters argument error when switching to DataflowRunner.

  File "/home/user/miniconda3/lib/python3.8/site-packages/apache_beam/options/pipeline_options.py", line 124, in add_value_provider_argument
    self.add_argument(*args, **kwargs)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1386, in add_argument
    return self._add_action(action)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1749, in _add_action
    self._optionals._add_action(action)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1590, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1400, in _add_action
    self._check_conflict(action)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1539, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1548, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --bucket_input: conflicting option string: --bucket_input

Here is how the argument defined and called

class CustomPipelineOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--bucket_input',
            default="device-file-dev",
            help='Raw device file bucket')

pipeline = beam.Pipeline(options=pipeline_options)

custom_options = pipeline_options.view_as(CustomPipelineOptions)

_ = (
        pipeline
        | 'Initiate dataflow' >> beam.Create(["Start"])
        | 'Create P collection with file paths' >> beam.ParDo(
            CreateGcsPCol(input_bucket=custom_options.bucket_input)
)

Notice this only happens with DataflowRunner. Anyone knows how to solve it? Thanks a lot.

Yun Chen
  • 11
  • 2
  • The error is telling you that the `parser` already has an argument with the ''- -bucket_input'` flag. You shouldn't be trying to add it again. – hpaulj Jan 28 '21 at 21:22
  • thanks @hpaulj. It however runs well with dataflow's DirectRunner -- all the arguments there are parsed and used correctly. Any idea why it fails only with DataflowRunner? – Yun Chen Jan 29 '21 at 07:50
  • Your code looks correct, so I'm thinking the problem is elsewhere. How are you executing the script? Notebook? Command line? Something else? – Cubez Jan 29 '21 at 20:26
  • @Cubez you are right. The error is caused by importing a local python submodule via a relative path. With DirectRunner, the relative path works but not with DataflowRunner. The problem was solved by installing both the dataflow pipeline module and the submodule, and importing from the installed submodule instead of using the relative path. – Yun Chen Feb 01 '21 at 10:29
  • @YunChen, could you elaborate on your solution? What do you mean by installing both the dataflow pipeline module and the submodule and importing from the submodule instead of using the relative path? Are you using the --setup_file argument to pass something to the script? Any advice would be great! – antti Apr 29 '21 at 11:16
  • @antti, yes, I am using --setup_file and install both pip and local packages in the setup.py. here is folder structures together with the setup.py ` |_pipeline/ |_pipeline/submodule |_setup.py ` and here is setup.py content ` import setuptools REQUIRED_PACKAGES = [ pip packages goes here ] setuptools.setup( name="xxx", version="0.0.1", description="xxxx", install_requires=REQUIRED_PACKAGES, package_dir={'main_module': 'pipeline', 'sub_module': 'pipeline/submodule'}, packages=["main_module", "sub_module"] ) ` – Yun Chen May 05 '21 at 11:01
  • Thanks @YunChen! And did your main_module and sub_module both define the --bucket_input argument? In my case I had two separate dataflow scripts A and B with separate pipeline arguments some of which had the same names. I had some infrastructure code that based on a variable value either run pipeline A or pipeline B. This didn't work in with DataflowRunner because the scripts had conflicting argument names, but had no issue in DirectRunner. In the end I had to combine A and B into a single script with arguments defined only once, but having two "run" methods. – antti May 06 '21 at 12:25
  • @antti I only defined argument in the main_module. main_module is a dataflow module and sub_module is a local python module without involving dataflow. I have actually already switched from dataflow to cloud function since google boosted cloud function up to 8GB RAM. This is because dataflow raised some runtime errors which I do not encounter in a cloud function setting. – Yun Chen May 06 '21 at 13:27

1 Answers1

0

Copying the answer from the comment here:

The error is caused by importing a local Python sub-module via a relative path. With the DirectRunner, the relative path works because it's on the local machine. However, the DataflowRunner is on a different machine (GCE Instance) and needs the absolute path. Thus, the problem was solved by installing both the Dataflow pipeline module, the sub-module, and importing from the installed sub-module -- instead of using the relative path.

Cubez
  • 878
  • 5
  • 11