0

I was attempting to run a TFX pipeline using BeamDagRunner where I was using Dataflow to both orchestrate pipeline and execute the tfx components. However I can't execute the components and my dataflow jobs fail saying setup.py not found. I believe what is happening is my component dataflow jobs are passed the beam pipeline arg --setup_file=/path/to/setup.py but that path doesn't exist on the orchestrator dataflow machine, only on my local. Is there a way to where I can pass that in to my component pipeline args properly? This works as expected when I orchestrate with a DirectRunner since the setup.py is found on the local path.

Small snippet:

from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner
from tfx.orchestration import pipeline

BeamDagRunner(
    beam_orchestrator_args=[
        '--setup_file=./setup.py',
        '--runner=DataflowRunner'
    ] 
).run(
    pipeline.Pipeline(
        ...
        beam_pipeline_args=[
            '--setup_file=./setup.py',
            '--runner=DataflowRunner'
        ]
    )
)

This snippet should run the orchestrator on Dataflow as well as execute the components using dataflow. However the components fail saying setup.py can't be found.

redwan
  • 1
  • 1
  • Can you explain more about your architecture and share a snippet of your code? I can't conclude much only with the current description. – rmesteves May 15 '20 at 14:26
  • Edited to add in a small snippet. Let me know if you need more details. – redwan May 15 '20 at 21:34
  • Have you tried following the instructions of this Guide on multiple file dependencies: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ Probably try out the existing example first. – chamikara May 18 '20 at 01:03
  • Where exactly are you running this code? Cloud Composer, Apache Beam? – rmesteves May 21 '20 at 11:14

0 Answers0