Page History
By default, Condor manages jobs under the assumption that the user wants them to be run as many times as necessary in order to successfully finish. If all goes well, this means the job will only run once. However, various failures can require that the job be restarted in order to succeed. Examples of such failures include:
- execute machine crashes while job is running
- submit machine crashes while job is running and does not reconnect to job before the job lease expires
- job is evicted by condor_vacate, PREEMPT, or is preempted by another job
- output files from job fail to be transferred back to submit machine
- input files fail to be transferred
- network failures between execute and submit machine
In some cases, it is desired that jobs not be restarted. The user wants Condor to try to run the job once, and if this attempt fails for any reason, it should not make a second attempt. To achieve this, the following can be put in the job's submit file:
requirements = NumJobStarts == 0 periodic_remove = JobStatus == 1 && NumJobStarts > 0
Note that this does not guarantee that Condor will only start the job once.  The NumJobStarts job attribute is updated shortly after the job starts running.  Various types of failures can result in the job starting without this attribute being updated (e.g. network failure between submit and execute machine).  By setting SHADOW_LAZY_QUEUE_UPDATE=false, the window of time between the job starting and the update of NumJobStarts can be decreased, but this still does not provide a guarantee that the job will never be started more than once.  This policy is therefore to run the job at least once, and, with best effort but no strong guarantee, not more than once.
