# Fill in executable and max expected runtime in minutes. # If the job runs longer than expected, it will go on hold, # and then will be restarted on a different machine. After # three restarts on three different machines, the job will # stay on hold. # executable = foo.exe expected_runtime_minutes = 20 # # Should not need to change the below... # job_machine_attrs = Machine job_machine_attrs_history_length = 4 requirements = target.machine =!= MachineAttrMachine1 && \ target.machine =!= MachineAttrMachine2 && \ target.machine =!= MachineAttrMachine3 periodic_hold = JobStatus == 2 && \ time() - EnteredCurrentStatus > 60 * $(expected_runtime_minutes) periodic_hold_subcode = 1 periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \ JobRunCount < 3 periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long") queue
Note the technique is to put the job on hold via periodic_hold
if it runs too long, resulting in the job going to the Held state. Next the job is released via periodic_release
, causing the job to go back to Idle and be rescheduled. The requirements
expression ensures the job runs on different machine entirely, not just a different slot on the same machine; see AvoidingBlackHoles.
Also note that periodic_release
expression only releases a job that was put on hold for a known cause, which we implement by utilizing the periodic_hold_subcode
attribute. After all, we don't want to release a job that was put on hold for a different reason, such as the user running condor_hold
. We also set periodic_hold_reason
to something helpful, so typing condor_q -hold
displays something informative.
Finally, note that we limit the number of times a job goes through the hold/release cycle.
All of the mechanisms used in the below submit file are described on the condor_submit manual page. Also may be useful to browse the documented job classad attributes.