3

I would like to trigger an Airflow DAD based on SQS messages. I am quite new to Airflow but this is how I think it should be done:

Option 1

Use the Airflow SQS Sensor. From my understanding, this waits on SQS messages to proceed with the execution of an already trigger DAG. Does this mean a DAG would always need to be running and waiting for SQS messages to catch any eventual new messages and process them? Does this also mean I should schedule my DAG on a very short interval so that when an SQS message gets handled by a DAG, another DAG is created to handle the next SQS messages?

Option 2

Add a lambda or something watching for SQS messages and using the Airflow API to trigger DAGs when needed.

Eventually, I would like to minimise the number of interactions needed to trigger a DAG so I would like to use an Airflow built-in way of watching SQS.

Thank you

ypicard
  • 3,593
  • 3
  • 20
  • 34

1 Answers1

1

Both options are valid however Option 2 is basically an alternative implementation to sensor. I think the better solution is Option 1 with some modification:

Use SQSSensor but with mode='reschedule' that way every once in a while the sensor is "awaking" checking if the criteria is met. Note that this is not like sleep(x). When the criteria isn't met Airflow will release the worker for other tasks that needs to run and return the SQSSensor to the scheduling queue. You can read more about the sensor modes in the docs.

from airflow.providers.amazon.aws.sensors.sqs import SQSSensor
SQSSensor(
    task_id='test_task',
    dag=dag,
    sqs_queue='your_queue',
    aws_conn_id='aws_default',
    mode='reschedule')

Note that the sensor will run indefinitely until the criteria is met. You can set timeout on the sensor task (there are other possible reasons for timeout like cluster policy and other defaults but that is another topic).

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • 2
    How would the sensor option handle a high peak load of incoming SQS messages? Will it be limited to the frequency it is scheduled at? Meaning if it is scheduled to run every second and that 2 incoming messages arrive every second, it would never be able to drain it completely? – ypicard May 28 '21 at 12:22
  • 1
    You are confusing between DAG `schedule_interval` and sensor poking. Once the DAG is running the `schedule_interval` is irrelevant. You can set the sensor with `poke_interval` according to your needs. – Elad Kalif May 28 '21 at 19:03