4

I am currently planning to upgrade our Cloud Composer environment from Composer 1 to 2. However I am quite concerned about disruptions that could occur in Cloud Composer 2 due to the new autoscaling behavior inherited from GKE Autopilot. In particular since nodes will now auto-scale based on demand, it seems like nodes with running workers could be killed off if GKE thinks the workers could be rescheduled elsewhere. This would be bad because my code isn't currently very tolerant to retries.

I think that this can be prevented by adding the following annotation to the worker pods: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

However, I don't know how to add annotations to worker pods created by Composer (I'm not creating them myself, after all). How can I do that?

EDIT: I think this issue is made more complex by the fact that it should still be possible for the cluster to evict a pod once it's finished processing all its Airflow tasks. If the annotation is added but doesn't go away once the pod is finished processing, I'm worried that could prevent the cluster from ever scaling down.

So a more dynamic solution may be needed, perhaps one that takes into account the actual tasks that Airflow is processing.

Stephen
  • 8,508
  • 12
  • 56
  • 96
  • I don't think this is possible in Cloud Composer where they don't allow the user to configure the Airflow services pods. I propose to create a support ticket on GCP to have a confirmed answer and ask them to support this feature if it is not supported – Hussein Awala Jan 21 '23 at 22:51
  • 1
    Are you worried about your DAGs or are you actually deploying other apps into the Composer environment? – Gari Singh Jan 26 '23 at 13:26
  • @GariSingh I'm worried about my DAGs. – Stephen Jan 26 '23 at 14:12
  • Putting on my theoretical hat, it seems like the core issue is that Kubernetes needs different "drain" behavior for jobs than for services. It already drains services well: As I understand it, it first stops sending traffic to them and then kills them by default 60 seconds later or something, by which point the server should have finished most of its processing. But jobs execute for a lot longer. Also just as importantly, a job will usually lose progress if forced to restart, unlike services. (Also, some jobs aren't perfectly idempotent, although this is arguably an antipattern.) – Stephen Jan 26 '23 at 22:25

1 Answers1

0

If I have understood your problem well. Could you please try this solution:

  1. In the Cloud Composer environment, navigate to the Kubernetes Engine --> Workloads page in the GCP Console.
  2. Find the worker pod you want to add the annotation to and click on the name of the pod.
  3. On the pod details page, click on the Edit button.
  4. In the Pod template section, find the Annotations field and click on the pencil icon to edit.
  5. In the Edit annotations field, add the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
  6. Click on the Save button to apply the change.

Let me know if it works fine. Good luck.

Muhammad Asadullah
  • 3,735
  • 1
  • 22
  • 38
  • 1
    If I'm doing that per-pod then this probably won't work unfortunately, because Composer will delete and create new pods on the fly which wouldn't have any annotations that I applied to previous pods. Right? – Stephen Jan 26 '23 at 20:42
  • Hi Stephen! Yes, that's correct. If you are adding the annotation on a per-pod basis through the GCP Console, the annotation will be lost when the pod is deleted and recreated by Composer. Regards, Asad. – Muhammad Asadullah Jan 26 '23 at 20:50
  • 2
    What about modifying the "pod template" used by the environment? This should make it easier. Editing the environment's configuration file and adding the annotation to the pod_template section. Hope it helps. – Muhammad Asadullah Jan 26 '23 at 20:52
  • Well, the thing is that I don't want to prevent ALL evictions of pods. I only want to prevent evictions of pods which are currently running Airflow tasks. Once a pod (which is running the Airflow worker program) has finished its tasks, it should be considered safe to evict, otherwise the cluster will never scale down. So I assume that a template will not help, because if I understand the issue correctly I would want these annotations to be dynamic. – Stephen Jan 26 '23 at 22:10