0

When submitting condor jobs, typically a few or more jobs can fail for unknown reasons, and these jobs have to be resubmitted. so I was wondering: What's the most efficient way of resubmitting failed condor jobs? i.e. with having to fish one by one and resubmit them

I tried to grep all the failed messages and extract the job id, but it's time consuming to manipulate

1 Answers1

0

How is the job failing? If it fails with a non-zero exit code, try setting

num_retries = 5

in your condor_submit file. That way, if the job exits with a non-zero exit code, condor will re-run it up to five times until it does exit zero.

Greg
  • 703
  • 3
  • 6
  • Hi Greg! Thanks a lot for your answer, unfortunately it does exit zero but the output is not complete, is there a way to put a condition based on the output for resubmitting a job? – StackExchanger Apr 11 '23 at 21:51
  • If you can write a script that detects the output is bad, you can submit your job as a single node dag with `condor_submit_dag` where the single node has a dag post script that runs this script, and returns non-zero if the output is bad, and tell dagman to retry the node in that case. Or, you could put your job inside a shell wrapper script on the worker node, and run this same script to detect bad output, and return non-zero, and use `num_retries` as above. – Greg Apr 12 '23 at 04:07