0

I am trying to run a job in Airflow 2.1.2 which executes a dataflow job. The dataflow job reads data from storage bucket and uploads it to bigquery. The dataflow_default_options in the DAG has region defined as europe-west1 however it is overridden by the actual job in DAG to us-central1. Due to this the dataflow job fails on big query upload as the region is us-central1

It was working fine before when I was using the older version of airflow(1.10.15). Code below:

DEFAULT_DAG_ARGS = {
    'start_date': YESTERDAY,
    'email': models.Variable.get('email'),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 0,
    'project_id': models.Variable.get('gcp_project'),
    'dataflow_default_options': {
        'region': 'europe-west1',
        'project': models.Variable.get('gcp_project'),
        'temp_location': models.Variable.get('gcp_temp_location'),
        'runner': 'DataflowRunner',
        'zone': 'europe-west1-d'
    }
}

with models.DAG(dag_id='GcsToBigQueryTriggered',
                description='A DAG triggered by an external Cloud Function',
                schedule_interval=None,
                default_args=DEFAULT_DAG_ARGS,
                max_active_runs=1) as dag:
    # Args required for the Dataflow job.
    job_args = {
        'input': 'gs://{{ dag_run.conf["bucket"] }}/{{ dag_run.conf["name"] }}',
        'output': models.Variable.get('bq_output_table'),
        'fields': models.Variable.get('input_field_names'),
        'load_dt': DS_TAG
    }

    # Main Dataflow task that will process and load the input delimited file.
    dataflow_task = dataflow_operator.DataFlowPythonOperator(
        task_id="data-ingest-gcs-process-bq",
        py_file=DATAFLOW_FILE,
        options=job_args)

If i change the region in the options of the dataflow_task to europe-west1, then the Dataflow job passes however it fails in Airflow with 404 error code as it waits for the JOB_DONE status of the dataflow job in the wrong region(us-central1).

Am I missing something ? Any help would be highly appreciated ?

ankie
  • 11
  • 2
  • Which Apache Beam SDK version are you using? This used to be [an issue](https://github.com/apache/airflow/issues/8630) in the past with Beam 2.20 – itroulli Oct 07 '21 at 08:09
  • I have deployed Apache Airflow in Google Cloud Composer, How can I check the SDK Version ? i am using the client libraries like below: from airflow.contrib.operators import dataflow_operator – ankie Oct 07 '21 at 09:09
  • @I think i found it in Dataflow job - SDK version Apache Beam Python 3.8 SDK 2.31.0 – ankie Oct 07 '21 at 09:19
  • @itroulli I had a look at the issue above, i think it is coming from here: def _set_variables(variables): if variables['project'] is None: raise Exception('Project not specified') if 'region' not in variables.keys(): variables['region'] = DEFAULT_DATAFLOW_LOCATION return variables – ankie Oct 07 '21 at 09:26
  • Since you moved to Airflow 2, it's better to use the new operators `DataflowCreatePythonJobOperator` or `BeamRunPythonPipelineOperator` which are located in `airflow.providers.google.cloud.operators.dataflow` and `airflow.providers.apache.beam.operators.beam` respectively. – itroulli Oct 07 '21 at 11:51
  • i changed the parameter to location and it worked – ankie Oct 20 '21 at 16:10

0 Answers0