2

I am building a google cloud dataflow pipeline to process videos. I am having a very hard time debugging the pipeline because the environment behavior seems different on DirectRunner versus DataflowRunner.

My video processing tool (called DeepMeerkat below) takes in arguments from argparse. I call the pipeline:

python run_clouddataflow.py \
    --runner DataFlowRunner \
    --project $PROJECT \
    --staging_location $BUCKET/staging \
    --temp_location $BUCKET/temp \
    --job_name $PROJECT-deepmeerkat \
    --setup_file ./setup.py \
    --maxNumWorkers 3 \
    --tensorflow \
    --training

Where the last two arguments, tensorflow and training are both for my pipeline, the rest are needed for clouddataflow.

I parse the args and pass the argv to the pipeline

beam.Pipeline(argv=pipeline_args)

and then within DeepMeerkat's argparse, parse just the known args.

args,_=parser.parse_known_args()

This works perfectly locally, tensorflow is turned off (default is on) and training is turned on (default is on). Printing args confirms the behavior. But then it fails to parse on cloud dataflow, tensorflow stays on, and training is off.

DirectRunner:

DeepMeerkat args: Namespace(tensorflow=False, training=True)

From the logging of DataFlowRunner:

DeepMeerkat args: Namespace(tensorflow=True, training=False)

Any ideas of what's going on here? Identical commands, identical code, just changing DirectRunner to DataFlowRunner.

I'd rather not go down the road of passing custom arguments to pipeline options, since I would then need to assign them somehow downstream, if one already has a tool that parses arguments, this seems like a much more straightforward solution, provided there isn't something special about the dataflow worker.

bw4sz
  • 2,237
  • 2
  • 29
  • 53

1 Answers1

1

I had the wrong conceptual model for this. Locally, each "worker" still has access to sys args, so it was not that the runner behavior was different, but rather the "worker" was circumventing the cloud pipeline and grabbing new args to parse. The way to do this in DataFlowRunner is to explicitly pass pipeline args to your DoFN function using an

__init__(self,args)

. Then parse those args internally within the beam pipeline as if they came from a string.

bw4sz
  • 2,237
  • 2
  • 29
  • 53