I try to run a Apache Beam pipeline (Python) within Google Cloud Dataflow, triggered by a DAG in Google Cloud Coomposer.
The structure of my dags folder in the respective GCS bucket is as follows:
/dags/
dataflow.py <- DAG
dataflow/
pipeline.py <- pipeline
setup.py
my_modules/
__init__.py
commons.py <- the module I want to import in the pipeline
The setup.py is very basic, but according to the Apache Beam docs and answers on SO:
import setuptools
setuptools.setup(setuptools.find_packages())
In the DAG file (dataflow.py) I set the setup_file
option and pass it to Dataflow:
default_dag_args = {
... ,
'dataflow_default_options': {
... ,
'runner': 'DataflowRunner',
'setup_file': os.path.join(configuration.get('core', 'dags_folder'), 'dataflow', 'setup.py')
}
}
Within the pipeline file (pipeline.py) I try to use
from my_modules import commons
but this fails. The log in Google Cloud Composer (Apache Airflow) says:
gcp_dataflow_hook.py:132} WARNING - b' File "/home/airflow/gcs/dags/dataflow/dataflow.py", line 11\n from my_modules import commons\n ^\nSyntaxError: invalid syntax'
The basic idea behind the setup.py file is documented here
Also, there are similar questions on SO which helped me:
Google Dataflow - Failed to import custom python modules
Dataflow/apache beam: manage custom module dependencies
I'm actually wondering why my pipelines fails with a Syntax Error
and not a module not found
kind of error...