0

We have a recurring job JOB_A which runs every 15 mins. If it fails, we have to force start another box,BOX_TO_FIX to fix the issue.

But the problem here is, our Operations team is taking more time 20-30 mins to respond to failure of JOB_A. Before they could start BOX_TO_FIX, this recurring job JOB_A starts again and fails for 2nd time.

Our concern is, another operator may take this 2nd alert and may run the BOX_TO_FIX second time which we have to avoid.

Is it possible to stop the recurring job JOB_A from scheduling after it failed in the first instance? If the status is failed, it should not start again until we fix the reason for failure?

Raghav
  • 21
  • 1
  • 3
  • Depending on what JOB_A is and how it fails you might be able to have that job put itself ON_ICE in case of failure with the sendevent command `sendevent -E ON_ICE -J "JOB_A"` – HBruijn Aug 22 '16 at 18:25

1 Answers1

0

Sounds like two workflow issues.

  1. Running BOX_TO_FIX when JOB_A fails.
  2. Not allowing JOB_A to run when it has failed, until BOX_TO_FIX can run.

Is it feasible to set a failure(JOB_A) condition on BOX_TO_FIX so it will automatically start up when JOB_A fails?

Regardless of that answer, you can set a global variable which disables JOB_A on its failure until it is reset by BOX_TO_FIX's success.

insert_job: JOB_A
condition: value(JOB_A_IS_BROKEN) = 0
etc.

insert_job: OMG_A_BROKE
condition: failure(JOB_A)
command: sendevent -E SET_GLOBAL -G JOB_A_IS_BROKEN=1

insert_job: BOX_TO_FIX_IS_FINISHED
box_name: BOX_TO_FIX
condition: success(last cmd in BOX_TO_FIX)
command: sendevent -E SET_GLOBAL -G JOB_A_IS_BROKEN=0
Erick B
  • 101
  • 2