0

I've been noticing that some of the DAG runs for an hourly DAG are being skipped, I checked the log for the DAG run before it started skipping and noticed it had actually been running for 7 hours which is why other DAG runs didn't happen, it is very strange since it usually only takes 30 min to finish running.

We're using Airflow version 2.0.2

This is what I saw in the logs:

2022-05-06 13:26:56,668] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:26:56,806] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:01,860] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:01,872] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:06,960] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:07,019] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:12,224] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:12,314] {taskinstance.py:630} DEBUG - Refreshed TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]>
[2022-05-06 13:27:17,368] {taskinstance.py:595} DEBUG - Refreshing TaskInstance <TaskInstance: dfp_hourly.revequery 2022-05-05T13:00:00+00:00 [running]> from DB
[2022-05-06 13:27:17,377] {taskinstance.py:630} DEBUG - Refreshed TaskInstance 
KristiLuna
  • 1,601
  • 2
  • 18
  • 52

1 Answers1

0

well, I think you are running too many task-parallel which causes them to run for hours, well this can be fixed by using Pool. Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools) by giving the pools a name and assigning it several worker slots.

Tasks can then be associated with one of the existing pools by using the pool parameter when creating tasks:

aggregate_db_message_job = BashOperator(
    task_id="aggregate_db_message_job",
    execution_timeout=timedelta(hours=3),
    pool="ep_data_pipeline_db_msg_agg",
    bash_command=aggregate_db_message_job_cmd,
    dag=dag,
)

aggregate_db_message_job.set_upstream(wait_for_empty_queue) Tasks will be scheduled as usual while the slots fill up. The number of slots occupied by a task can be configured by pool_slots (see the section below). Once capacity is reached, runnable tasks get queued and their state will show as such in the UI. As slots free up, queued tasks start running based on the Priority Weights of the task and its descendants.

Note that if tasks are not given a pool, they are assigned to a default pool default_pool, which is initialized with 128 slots and can be modified through the UI or CLI (but cannot be removed).

Mobeen
  • 76
  • 9
  • thanks for responding. I've assigned pool slots to other tasks, and can add to this task as well. Is there anything adverse to creating a new pool with another 128 slots and using that? – KristiLuna May 09 '22 at 13:43
  • 128 slots are default you can alter it to your needs or your specific set of tasks – Mobeen May 10 '22 at 06:36
  • are there any negative implications to increasing the default slots to 200? – KristiLuna May 11 '22 at 18:06
  • You don't have to you can create new pools for your specific need, as airflow has set 128 as a default you can always create separate pool for tasks that will take time – Mobeen May 12 '22 at 06:30