I am trying to run a job in Airflow 2.1.2 which executes a dataflow job. The dataflow job reads data from storage bucket and uploads it to bigquery. The dataflow_default_options in the DAG has region defined as europe-west1 however it is overridden by the actual job in DAG to us-central1. Due to this the dataflow job fails on big query upload as the region is us-central1
It was working fine before when I was using the older version of airflow(1.10.15). Code below:
DEFAULT_DAG_ARGS = {
'start_date': YESTERDAY,
'email': models.Variable.get('email'),
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'project_id': models.Variable.get('gcp_project'),
'dataflow_default_options': {
'region': 'europe-west1',
'project': models.Variable.get('gcp_project'),
'temp_location': models.Variable.get('gcp_temp_location'),
'runner': 'DataflowRunner',
'zone': 'europe-west1-d'
}
}
with models.DAG(dag_id='GcsToBigQueryTriggered',
description='A DAG triggered by an external Cloud Function',
schedule_interval=None,
default_args=DEFAULT_DAG_ARGS,
max_active_runs=1) as dag:
# Args required for the Dataflow job.
job_args = {
'input': 'gs://{{ dag_run.conf["bucket"] }}/{{ dag_run.conf["name"] }}',
'output': models.Variable.get('bq_output_table'),
'fields': models.Variable.get('input_field_names'),
'load_dt': DS_TAG
}
# Main Dataflow task that will process and load the input delimited file.
dataflow_task = dataflow_operator.DataFlowPythonOperator(
task_id="data-ingest-gcs-process-bq",
py_file=DATAFLOW_FILE,
options=job_args)
If i change the region in the options of the dataflow_task to europe-west1, then the Dataflow job passes however it fails in Airflow with 404 error code as it waits for the JOB_DONE status of the dataflow job in the wrong region(us-central1).
Am I missing something ? Any help would be highly appreciated ?