Airflow fault tolerance

Question

I have 2 questions:

first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?

Note: I have started learning airflow recently. Thanks in advance

This is a theoretical question that faced me while learning apache airflow, I have read the documentation but it did not mention how fault tolerance is handled

score 0 · Answer 1 · answered Nov 17 '22 at 23:03

what does it mean that the Kubernetes executor is fault tolerance?

Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.

is it possible that the whole Airflow server gets down?

yes it is possible for different reasons, and you have some different solutions/tips for each one:

problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
- you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
- you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
- Try to request enough resources (especially memory) to avoid OOM problem
- Avoid running it on spot/preemptible VMs
- Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
- Try to request enough resources (especially memory) to avoid OOM problem
- It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Airflow fault tolerance

1 Answers1