1

Can you please kindly suggest.

I run GPU-based workload, and my instances are g4dn.large. I would like to use spot instances, since on-demand cost a lot :-)

But it is very usual that spot GPU instances are not available for long periods. I have configured initially origian aws cluster autoscaler, with priorities scaling - so i had 2 node groups, one spot and one on-demand, and autoscaler scaled first spot, and if not available - it scaled on-demand.

But after some time all instances became on-demand - due to absent spot capacity.

Autoscaler tries to scale up spot nodegroup, it has no capacity, it scales on-demand then, pod is running, happy time!

But there is no logic to try spot capacity again and to rebalance pods again, so my pod will be rescheduled to spot instance when it comes up. yes, i can delete on-demand node after some time, and if spot capacity can be fullfilled - it will create spot instance.

I have tried Karpenter, and it seems to do some work - but not the way i would like. It is possible to configure "node expiration" in Karpenter, so it will, for example, expire node every 5 minutes. But it doesn't care about spot or ondemand in expiration logic. So if we have spot instance - it will be expired. But if we have no spot capacity at the moment - and we get on-demand, it will expire it in 5 minutes, and tries to get spot capacity.

Can you please kindly suggest, how can i achieve the schema, when my EKS cluster has GPU instances, and if there is no spot capacity - it creates ondemand instance and constantly tries to create spot capacity, and when succeded - it reschedule pod to spot and terminates on-demand instance?

Any help will be extremely appreciated!

dmitrii
  • 31
  • 1

1 Answers1

2

should help:

you can modify onDemandBaseCapacity or onDemandPercentageAboveBaseCapacity based on your need on the fail over to on-demand node.

kholisrag
  • 353
  • 2
  • 13