We have a recurring job JOB_A which runs every 15 mins. If it fails, we have to force start another box,BOX_TO_FIX to fix the issue.
But the problem here is, our Operations team is taking more time 20-30 mins to respond to failure of JOB_A. Before they could start BOX_TO_FIX, this recurring job JOB_A starts again and fails for 2nd time.
Our concern is, another operator may take this 2nd alert and may run the BOX_TO_FIX second time which we have to avoid.
Is it possible to stop the recurring job JOB_A from scheduling after it failed in the first instance? If the status is failed, it should not start again until we fix the reason for failure?