0

I have a DAG which has step read_date_information_file which reads a file and returns the list of queries (which I can access from output). I then want to loop through this and execute the queries on Athena using AWSAthenaOperator for each query in this list.

def get_date_information(ti):
    s3 = boto3.client('s3')
    data = s3.get_object(Bucket=output_bucket, Key=key)
    contents = data['Body'].read().decode("utf-8")
    print('Date information is: ', contents)
    events_list = contents.split(',')
    return events_list

with DAG(
    dag_id='adserver_split_job_emr_job_dag',
    default_args={
        'owner': 'adserver_airflow',
        'depends_on_past': False,
        'email': ['airflow@example.com'],
        'email_on_failure': False,
        'email_on_retry': False,
    },
    dagrun_timeout=timedelta(hours=2),
    start_date=datetime(2021, 9, 22, 9),
    schedule_interval='20 * * * *',
) as dag:

    read_date_information_file = PythonOperator(
        task_id="read_date_information_file",
        python_callable=get_date_information
    )

    query_list = read_date_information_file.output

    for i, event in enumerate(query_list):
        run_query = AWSAthenaOperator(
            task_id=f'run_query_{i}',
            query=event,
            output_location=config.ATHENA_OUTPUT_LOCATION,
            database=config.ATHENA_DATABASE_NAME,
            aws_conn_id='aws_default'
        )

    read_date_information_file >> run_query

I get the error such as:

Broken DAG: [/opt/airflow/dags/test.py] Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 593, in __setattr__
    super().__setattr__(key, value)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
    raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /opt/airflow/dags/test.py after 30.0s, PID: 10056

But if I set query_list to hard coded one then it works fine, like:

query_list = ["SELECT 1;", "SELECT 2;", "SELECT 3;"]

Any help in this regard. I am following this looping approach from the solution mentioned here. The difference is that I am looping on the dynamic list which is the output from the previous step.

seou1
  • 446
  • 1
  • 5
  • 21
  • Your code is looping over xcoms which is very different than what is suggested in the answer you shared. Creating tasks based on xcoms is a bad practice. As for the error it's hard to assist because it's seems related to timeout that you defined. either your tasks needs more than 2 hours to complete or something is wrong with the functions that you execute (again you did not share the functions code so we can't tell). – Elad Kalif Sep 22 '21 at 19:02
  • def get_date_information(): data = s3.get_object(Bucket=output_bucket, Key=key) contents = data['Body'].read().decode("utf-8") events_list = contents.split(',') return events_list – seou1 Sep 22 '21 at 19:36
  • Then what is the way to loop on the output of a step? – seou1 Sep 22 '21 at 19:38
  • I have edited the question. Maybe this can help more. – seou1 Sep 22 '21 at 20:23

1 Answers1

0

The issue you are facing is not directly related to Athena. It's more of a wrong usage of Airflow.

You are experiencing this issue because Airflow was unable to import your DAG into DagBag due to timeout (source code)

This happens because you are making expensive call to the meta database by trying to create tasks from Xcoms. Airflow parse your DAG every 30 seconds (default value of min_file_process_interval ) This means that every 30 seconds you are opening connection to the database. This is a bad practice and you should not do that! It can easily overwhelm your database.

If you still wish to continue with this dangerous path you need to change the default of DAGBAG_IMPORT_TIMEOUT in airflow.cfg (see source code)

My recommendation: Do not try to dynamically create tasks based on Xcoms.

As for your specific use case - you did not mention what you are trying to solve. I would assume based on the function name (read_date_information_file) that you are trying to run Athena query for a specific date and this date changes whenever someone parses the file. Maybe you should just try to integrate the date directly into your query using Jinja. Meaning that your event in query=event will contain the reference to the xcom directly something like:

SELECT ...
FROM ...
WHERE something={{ ti.xcom_pull(task_ids='read_date_information_file') }}

That way all you need is a single static AWSAthenaOperator but the query it runs is dynamically changed based on the Xcom value pushed by read_date_information_file task.

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • The file that is read can have multiple queries 1, 2, 3 etc.. so I need to execute those queries via Athena operator. For this purpose I am using the for loop to enumerate over the list of queries and execute them one by one. – seou1 Sep 22 '21 at 21:33
  • @seoul1 this is not the way to solve this problem. You should create a custom operator that can handle list of queries. There is open request for that https://github.com/apache/airflow/issues/8449 you can use the example code in the issue to start. – Elad Kalif Sep 22 '21 at 23:30