2

I am in the process of migrating our Airflow environment from version 1.10.15 to 2.3.3. I have migrated 1 DAG over to the new environment and intermittently I get an email with this error: Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?

When looking at the logs, this is what I find in the scheduler logs:

[2022-08-09 07:00:08,621] {dag.py:2968} INFO - Setting next_dagrun for DAGRP-Get_Overrides to 2022-08-09T11:00:00+00:00, run_after=2022-08-09T16:00:00+00:00
[2022-08-09 07:00:08,652] {scheduler_job.py:353} INFO - 1 tasks up for execution:
        <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,652] {scheduler_job.py:418} INFO - DAG DAGRP-Get_Overrides has 0/3 running and queued tasks
[2022-08-09 07:00:08,652] {scheduler_job.py:504} INFO - Setting the following tasks to queued state:
        <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [scheduled]>
[2022-08-09 07:00:08,654] {scheduler_job.py:546} INFO - Sending TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
[2022-08-09 07:00:08,654] {base_executor.py:91} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'DAGRP-Get_Overrides', 'Get_override', 'scheduled__2022-08-08T16:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/da_group/get_override.py']
[2022-08-09 07:00:12,665] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:12,667] {celery_executor.py:283} INFO - [Try 1 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:16,701] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:16,702] {celery_executor.py:283} INFO - [Try 2 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:21,704] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:21,705] {celery_executor.py:283} INFO - [Try 3 of 3] Task Timeout Error for Task: (TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)).
[2022-08-09 07:00:26,627] {timeout.py:67} ERROR - Process timed out, PID: 1
[2022-08-09 07:00:26,627] {celery_executor.py:294} ERROR - Error sending Celery task: Timeout, PID: 1
Celery Task ID: TaskInstanceKey(dag_id='DAGRP-Get_Overrides', task_id='Get_override', run_id='scheduled__2022-08-08T16:00:00+00:00', try_number=1, map_index=-1)
Traceback (most recent call last):
  File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 30, in __call__
    return self.__value__
AttributeError: 'ChannelPromise' object has no attribute '__value__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/airflow/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 177, in send_task_to_executor
    result = task_to_run.apply_async(args=[command], queue=queue)
  File "/opt/airflow/lib/python3.8/site-packages/celery/app/task.py", line 575, in apply_async
    return app.send_task(
  File "/opt/airflow/lib/python3.8/site-packages/celery/app/base.py", line 788, in send_task
    amqp.send_task_message(P, name, message, **options)
  File "/opt/airflow/lib/python3.8/site-packages/celery/app/amqp.py", line 510, in send_task_message
    ret = producer.publish(
  File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 177, in publish
    return _publish(
  File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 523, in _ensured
    return fun(*args, **kwargs)
  File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 186, in _publish
    channel = self.channel
  File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 209, in _get_channel
    channel = self._channel = channel()
  File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 32, in __call__
    value = self.__value__ = self.__contract__()
  File "/opt/airflow/lib/python3.8/site-packages/kombu/messaging.py", line 225, in <lambda>
    channel = ChannelPromise(lambda: connection.default_channel)
  File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 895, in default_channel
    self._ensure_connection(**conn_opts)
  File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 433, in _ensure_connection
    return retry_over_time(
  File "/opt/airflow/lib/python3.8/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
    return fun(*args, **kwargs)
  File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 877, in _connection_factory
    self._connection = self._establish_connection()
  File "/opt/airflow/lib/python3.8/site-packages/kombu/connection.py", line 812, in _establish_connection
    conn = self.transport.establish_connection()
  File "/opt/airflow/lib/python3.8/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
    conn.connect()
  File "/opt/airflow/lib/python3.8/site-packages/amqp/connection.py", line 323, in connect
    self.transport.connect()
  File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 129, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/opt/airflow/lib/python3.8/site-packages/amqp/transport.py", line 184, in _connect
    self.sock.connect(sa)
  File "/opt/airflow/lib/python3.8/site-packages/airflow/utils/timeout.py", line 68, in handle_timeout
    raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1


[2022-08-09 07:00:26,627] {scheduler_job.py:599} INFO - Executor reports execution of DAGRP-Get_Overrides.Get_override run_id=scheduled__2022-08-08T16:00:00+00:00 exited with status failed for try_number 1
[2022-08-09 07:00:26,633] {scheduler_job.py:642} INFO - TaskInstance Finished: dag_id=DAGRP-Get_Overrides, task_id=Get_override, run_id=scheduled__2022-08-08T16:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=default, priority_weight=1, operator=PythonOperator, queued_dttm=2022-08-09 11:00:08.652767+00:00, queued_by_job_id=56, pid=None
[2022-08-09 07:00:26,633] {scheduler_job.py:684} ERROR - Executor reports task instance <TaskInstance: DAGRP-Get_Overrides.Get_override scheduled__2022-08-08T16:00:00+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?
[2022-08-09 07:01:16,687] {processor.py:233} WARNING - Killing DAGFileProcessorProcess (PID=1811)
[2022-08-09 07:04:00,640] {scheduler_job.py:1233} INFO - Resetting orphaned tasks for active dag runs

I am running Airflow on 2 servers with 2 of each service (2 schedulers, 2 workers, 2 webservers). They are running in docker containers. They are configured to use celery executor and I'm using RabbitMQ version 3.10.6 (also 2 instances in docker containers behind a LB). I am using Postgres 13.7 for our database (running one instance in a docker container on the 1st server). Our environment is running on Python 3.8.12.

From my understanding, the timeout is between the scheduler and rabbitmq? From what I can tell we are hitting this timeout: AIRFLOW__CELERY__OPERATION_TIMEOUT (it's currently set to 4).

I would like to track down what is causing the issue before I just increase timeout settings. What can I do to find out what's going on? Anyone else run into this issue? Am I correct in assuming the timeout is between the scheduler and rabbitmq? Is it between the scheduler and database? Why am I seeing this with Airflow 2 when I have the same setup with Airflow 1 and it works with no problems? Any help is greatly appreciated!

Update: I was able to reproduce the error by shutting down 1 of the rabbitmq nodes. Even though rabbitmq is behind a LB with a health probe, whenever a job was picked up by scheduler 1, it would fail with this error... But if scheduler 2 picked up the job, it would finish successfully. The odd thing is that I shut down rabbitmq 2..

wymangr
  • 189
  • 3
  • 16

1 Answers1

0

So I think I've been able to solve this issue. Here is what I did:

I added a custom celery_config.py to the scheduler and worker docker containers, adding this environment variable: AIRFLOW__CELERY__CELERY_CONFIG_OPTIONS=celery_config.CELERY_CONFIG. As part of that celery config, I specified both my rabbitmq brokers under broker_url. This is the full config:

from airflow.config_templates.default_celery import DEFAULT_CELERY_CONFIG
import os

RABBITMQ_PW = os.environ["RABBITMQ_PW"]
CLUSTER_NODE = os.environ["RABBITMQ_CLUSTER_NODE"]
LOCAL_NODE = os.environ["RABBITMQ_NODE"]


CELERY_CONFIG = {
    **DEFAULT_CELERY_CONFIG,
    "worker_send_task_events": True,
    "task_send_sent_event": True,
    "result_extended": True,
    "broker_url": [
        f'amqp://rabbitmq:{RABBITMQ_PW}@{LOCAL_NODE}:5672',
        f'amqp://rabbitmq:{RABBITMQ_PW}@{CLUSTER_NODE}:5672'
    ]
}

What happens now in the worker, if it looses connection to the 1st broker, it will attempt to connect to the 2nd broker.

[2022-08-11 12:00:52,876: ERROR/MainProcess] consumer: Cannot connect to amqp://rabbitmq:**@<LOCAL_NODE>:5672//: [Errno 111] Connection refused.
[2022-08-11 12:00:52,875: INFO/MainProcess] Connected to amqp://rabbitmq:**@<CLUSTER_NODE>:5672//

Also an interesting note, I still have the Airflow environment variable AIRFLOW__CELERY__BROKER_URL set to the load balancer URL. That's because Airflow 1 won't allow the worker to start without it, and 2 won't allow you to specify multiple brokers like the celery config does. So when the worker starts, it shows:

- ** ---------- .> transport:   amqp://rabbitmq:**@<LOCAL_NODE>:5672//
[2022-08-26 11:37:17,952: INFO/MainProcess] Connected to amqp://rabbitmq:**@<LOCAL_NODE>:5672//

Even though I have the LB configured for the AIRFLOW__CELERY__BROKER_URL

wymangr
  • 189
  • 3
  • 16