I have a long-running Cloud Composer Airflow task that kicks off a job using the KubernetesPodOperator
. Sometimes it finishes successfully after about two hours, but more often it gets marked as failed with the following error in the Airflow worker log:
[2019-06-24 18:49:34,718] {jobs.py:2685} WARNING - The recorded hostname airflow-worker-xxxxxxxxxx-aaaaa does not match this instance's hostname airflow-worker-xxxxxxxxxx-bbbbb
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
...
File "/usr/local/lib/airflow/airflow/jobs.py", line 2686, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
After the task is marked as failed, the actual KubernetesPodOperator
job still finishes successfully without any errors. Both of the workers referenced in the log, airflow-worker-xxxxxxxxxx-aaaaa
and airflow-worker-xxxxxxxxxx-bbbbb
, are still up and running.
This Airflow PR made it possible to override the hostname, but I can't tell if that's an appropriate solution in this case, since none of the workers appear to have died or changed during the task run. Is it normal for a running task to be reassigned to a different worker? And if so, why does the the Airflow source fail the task in the event of a hostname mismatch?