0

Suppose I have a 3 node EKS cluster made up of 3 spot instances (we'll call them Node A, B, and C), and each node has critical pods scheduled. The EKS cluster has the EKS Node Termination Handler running. Metadata gets posted saying that in 2 minutes Node A is going to be reclaimed by Amazon.

The Node Termination handler cordons and drains the node being taken (Node A), and a new node spins up. The pods from Node A are then scheduled on the Node A Replacement. If this completes in two minutes time, perfect.

Is there a benefit to having spare capacity around (Node D). If Node A is taken back by Amazon, will my pods be rescheduled on Node D since it is already available?

In this architecture, it seems like a great idea to have a spare node or two around for pod rescheduling so I don't have a risk of the 2 minute window. Do I need to do anything special to make sure the pods are rescheduled in the most efficient way?

1 Answers1

1

Is there a benefit to having spare capacity around (Node D). If Node A is taken back by Amazon, will my pods be rescheduled on Node D since it is already available?

Yes definitely, there are chances POD will get scheduled on that node if not any specific argument attached to deployment like Node selector, taint, affinity etc.

Do I need to do anything special to make sure the pods are rescheduled in the most efficient way?

That sounds a good idea but what if all at the same time 3 PODs get termination signal, in 2 Min all POD can be rescheduled to new Nodes?

New 3 nodes will be available or Single D node will be available?

You might need to take care about the Size of all PODs being scheduled on the number of Nodes, Readiness-liveness with proper fast configuration so POD comes up asap and handle the traffic.

If your Single D node is running, and all 3 Spot instances get terminated that can create issue, How about the PODs of Nginx ingress or service mesh you will be running?

If Nginx PODs are getting scheduled, those may take a few sec time sometimes if they are Rollingupdate then it's fine.

mozello
  • 1,083
  • 3
  • 8
Harsh Manvar
  • 27,020
  • 6
  • 48
  • 102