5

I am using AWS batch for executing jobs, I am calculating the initial memory to use by content size. About 90% of times its successful but 10% times it fails with OutOfMemory error.

So for next attempt for this failed jobs, I would like to increase the memory and submit the job again. I can not use AWS batch Job Attempts for this, I will need a different FailOver Strategy.

One way I can use is to have a lambda to check the job status every 1 hr and if its failed submit the job again with additional memory.

Are there any other better ways to have FailOver strategy for AWS Batch jobs?

Chamin Wickramarathna
  • 1,672
  • 2
  • 20
  • 34
mightymahesh
  • 303
  • 3
  • 14

1 Answers1

0

Good question; I dont know of any scheduler (LSF, SLURM, AWS Batch) which supports this as it's IMHO not really what a scheduler should do - more the engine which executes your workflow ( think nextflow / ehive );

You can monitor your container status with AWS 'containerInsights' - see

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

Hope this helps you out.

tweep
  • 146
  • 1
  • 3