# Fill in executable and max expected runtime in minutes. # If the job runs longer than expected, it will go on hold, # and then will be restarted on a different machine. After # three restarts on three different machines, the job will # stay on hold. # executable = foo.exe expected_runtime_minutes = 20 # # Should not need to change the below... # job_machine_attrs = Machine job_machine_attrs_history_length = 4 requirements = target.machine =!= MachineAttrMachine1 && \ target.machine =!= MachineAttrMachine2 && \ target.machine =!= MachineAttrMachine3 periodic_hold = JobStatus == 2 && \ time() - EnteredCurrentStatus > 60 * $(expected_runtime_minutes) periodic_hold_subcode = 1 periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \ JobRunCount < 3 periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long") queue
Note the technique is to put the job on hold via
periodic_hold if it runs too long, resulting in the job going to the Held state. Next the job is released via
periodic_release, causing the job to go back to Idle and be rescheduled. The
requirements expression ensures the job runs on different machine entirely, not just a different slot on the same machine; see AvoidingBlackHoles.
Also note that
periodic_release expression only releases a job that was put on hold for a known cause, which we implement by utilizing the
periodic_hold_subcode attribute. After all, we don't want to release a job that was put on hold for a different reason, such as the user running
condor_hold. We also set
periodic_hold_reason to something helpful, so typing
condor_q -hold displays something informative.
Finally, note that we limit the number of times a job goes through the hold/release cycle.