2

I am running Airflow:2.1.4 on a KubernetesExecutor with Kubernetes: 1.21.0

Getting AirflowException: Task received SIGTERM signal

Stack trace:

[2021-10-11 06:22:52,543] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 17
[2021-10-11 06:22:52,543] {taskinstance.py:1236} **ERROR - Received SIGTERM. Terminating subprocesses.**
[2021-10-11 06:22:52,560] {taskinstance.py:1463} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1165, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1283, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1313, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 150, in execute
    return_value = self.execute_callable()
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 161, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/a.py", line 12, in test
    time.sleep(5)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1238, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal

When a task receives heartbeat_callback(according to job_heartbeat_sec in airflow configuration), the recorded pid and current pid do not match because this part of the code is executed. https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L201-L203

if ti.run_as_user or self.task_runner.run_as_user:
            recorded_pid = psutil.Process(ti.pid).ppid()
            same_process = recorded_pid == current_pid

I tried printing the values of run_as_user here by changing source code:

ti.run_as_user = None self.task_runner.run_as_user = airflow

I have tried changing the job_heartbeat_sec value and the task always fails at that point. I am running the scheduler, webserver as airflow user. The DAGs are also triggered by the airflow user. I have tried setting the run_as_user in DAG definition as both None or airflow, but same error.

Airflow user creation from Docker file:

RUN chown -R airflow: /usr/local/bin/airflow
RUN chown -R airflow: /root
USER airflow
WORKDIR ${AIRFLOW_HOME}

Airflow Configuration

[scheduler]
job_heartbeat_sec = 30
clean_tis_without_dagrun_interval = 15.0
scheduler_heartbeat_sec = 10
num_runs = -1
processor_poll_interval = 1
min_file_process_interval = 0
dag_dir_list_interval = 300
print_stats_interval = 30
pool_metrics_interval = 5.0
scheduler_health_check_threshold = 295
orphaned_tasks_check_interval = 305.0
child_process_log_directory = /root/airflow/logs/scheduler
scheduler_zombie_task_threshold = 300
catchup_by_default = True
max_tis_per_query = 512
use_row_level_locking = True
max_dagruns_to_create_per_loop = 10
max_dagruns_per_loop_to_schedule = 20
schedule_after_task_execution = False
parsing_processes = 2
file_parsing_sort_mode = modified_time
use_job_schedule = True
allow_trigger_in_future = False
dependency_detector = airflow.serialization.serialized_objects.DependencyDetector
run_duration = -1
max_threads = 2
authenticate = False


[kubernetes]
pod_template_file = /usr/local/airflow/pod_templates/pod_template_file.yaml
worker_container_repository = ************************
worker_container_tag = ****************************
namespace = default
delete_worker_pods = False
delete_worker_pods_on_failure = False
worker_pods_creation_batch_size = 1
multi_namespace_mode = False
in_cluster = True
kube_client_request_args =
delete_option_kwargs =
enable_tcp_keepalive = False
tcp_keep_idle = 120
tcp_keep_intvl = 30
tcp_keep_cnt = 6
verify_ssl = True
worker_pods_pending_timeout = 300
worker_pods_pending_timeout_check_interval = 120
worker_pods_pending_timeout_batch_size = 100
airflow_configmap = airflow-configmap
airflow_local_settings_configmap = airflow-configmap

2 Answers2

1

Bug fix should be coming in 2.2.3

Eugene
  • 71
  • 1
  • 1
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 30 '21 at 19:03
0

One of the reasons this might be happening in any version of Airflow is, because of setting retries on any DAG. The DAG arguments at creation:

'retries': 1,
'retry_delay': timedelta(minutes=5)

means that when the task first instance has ran for 5 minutes and not completed, launch the next instance. (Airflow calls it task attempts)

In my case, I was running a docker container inside the DAG task. The container was a long running job which was failing after 5 minutes(the retry_delay) with error:

[2023-06-02, 13:12:43 UTC] {local_task_job_runner.py:271} WARNING - Recorded pid 2038103 does not match the current pid 2036364
[2023-06-02, 13:12:43 UTC] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 2036364. PIDs of all processes in the group: [2036364]
[2023-06-02, 13:12:43 UTC] {process_utils.py:86} INFO - Sending the signal Signals.SIGTERM to group 2036364
[2023-06-02, 13:12:43 UTC] {taskinstance.py:1517} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-06-02, 13:12:43 UTC] {docker.py:495} INFO - Stopping docker container
[2023-06-02, 13:12:46 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/airflow/providers/docker/operators/docker.py", line 478, in execute
    return self._run_image()
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/airflow/providers/docker/operators/docker.py", line 339, in _run_image
    return self._run_image_with_mounts(self.mounts + [tmp_mount], add_tmp_variable=True)
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/airflow/providers/docker/operators/docker.py", line 400, in _run_image_with_mounts
    for log_chunk in logstream:
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/types/daemon.py", line 29, in __next__
    return next(self._stream)
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/api/client.py", line 422, in <genexpr>
    gen = (data for (_, data) in gen)
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/utils/socket.py", line 113, in frames_iter_no_tty
    (stream, n) = next_frame_header(socket)
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/utils/socket.py", line 85, in next_frame_header
    data = read_exactly(socket, 8)
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/utils/socket.py", line 70, in read_exactly
    next_data = read(socket, n - len(data))
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/docker/utils/socket.py", line 41, in read
    poll.poll()
  File "/home/ubuntu/work-nisha/airflow-poc/venv/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1519, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2023-06-02, 13:12:46 UTC] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=s000_docker1, task_id=docker_command, execution_date=20230602T130758, start_date=20230602T131226, end_date=20230602T131246
[2023-06-02, 13:12:46 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 613 for task docker_command (Task received SIGTERM signal; 2036364)

Comparing the task instance PID for first and second attempt, it was matching. Recorded pid 2038103 does not match the current pid 2036364

The first attempt PID was 2036364 and second attempt PID was 2038103.

And this next attempt was causing the first attempt long process to fail. And since airflow kills all the processes in the group, so second attempt was failing too.

See the airflow code: https://github.com/apache/airflow/blob/dc5bf3fd02c589578209cb0dd5b7d005b1516ae9/airflow/jobs/local_task_job_runner.py#L270

Solution:

If it is a long running task, do not add retry in DAG, or add the retry_delay to a value within which the task instance is expected to complete.

Nisha Sinha
  • 150
  • 1
  • 5