I am building a google cloud dataflow pipeline to process videos. I am having a very hard time debugging the pipeline because the environment behavior seems different on DirectRunner versus DataflowRunner.
My video processing tool (called DeepMeerkat below) takes in arguments from argparse. I call the pipeline:
python run_clouddataflow.py \
--runner DataFlowRunner \
--project $PROJECT \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--job_name $PROJECT-deepmeerkat \
--setup_file ./setup.py \
--maxNumWorkers 3 \
--tensorflow \
--training
Where the last two arguments, tensorflow and training are both for my pipeline, the rest are needed for clouddataflow.
I parse the args and pass the argv to the pipeline
beam.Pipeline(argv=pipeline_args)
and then within DeepMeerkat's argparse, parse just the known args.
args,_=parser.parse_known_args()
This works perfectly locally, tensorflow is turned off (default is on) and training is turned on (default is on). Printing args confirms the behavior. But then it fails to parse on cloud dataflow, tensorflow stays on, and training is off.
DirectRunner:
DeepMeerkat args: Namespace(tensorflow=False, training=True)
From the logging of DataFlowRunner:
DeepMeerkat args: Namespace(tensorflow=True, training=False)
Any ideas of what's going on here? Identical commands, identical code, just changing DirectRunner to DataFlowRunner.
I'd rather not go down the road of passing custom arguments to pipeline options, since I would then need to assign them somehow downstream, if one already has a tool that parses arguments, this seems like a much more straightforward solution, provided there isn't something special about the dataflow worker.