HTCondorWiki: Dag Submit Retries

It's important to note that submit retries are entirely separate from node retries. Submit retries handle the case where the condor_submit command for a node job failed (because of the schedd being too heavily loaded and timing out, for example); in other words, the node job never even gets into the HTCondor queue. Node retries, on the other hand, handle the case where the node as a whole has failed (the HTCondor job runs but fails, for example).

If a node job submit fails, the submit is re-tried without this being considered a node failure. By default, a submit will be re-tried up to 6 times, with an exponential backoff on the time between retries, so as to not overload the schedd. If all of the submit retries fail, that is considered a single node failure. So if, for example, a node had 3 retries set, up to 24 attempts to submit the node job could be made.

The number of submit retries can be changed globally for a DAG, but not individually for a single node the way node-level retries can be.