How to prevent OOMKilled when running parallel KATIB trials with unbalanced resource requirements?

Asked Jul 21 '22 at 07:52

Active Jul 21 '22 at 07:52

Viewed 77 times

I've got very unbalanced (exponential) memory requirements for different Katib trials. When running smaller trials it is perfectly fine to run 16 in parallel on my 4 node cluster - but when the larger ones run they use up a lot of memory and I get OOMKilled from Kubernetes.

Ideally I would like to control the amount of parallelization based on the hyperparameters chosen but this doesn't seem to be possible in Katib.

Is there another way of preventing those trial pods to be scheduled in parallel and somehow keep them in "pending" until the resources are free again? maybe on the Kubernetes level?

asked Jul 21 '22 at 07:52

Romeo Kienzler

3,373
3
36
58

How to prevent OOMKilled when running parallel KATIB trials with unbalanced resource requirements?

0 Answers0