A common problem is when some execute machine is misconfigured or broken in such a way that it still accepts HTCondor jobs, but can't run them correctly. If jobs exit quickly on this kind of machine, it can quickly eat many of the jobs in your queue. We call this a "black hole" machine. Some conditions observed to cause black holes are documented in this wiki.
To work around black hole machines, you can do the following in your job submit file:
match_list_length = 5
This tells HTCondor to save the last five machine names in your job ad in the following attributes:
LastMatchName0 = "current-machine" LastMatchName1 = "next-most-recent-Name" LastMatchName2 = "next-next-most-recent-Macine" ...
You can then tell HTCondor that if a job is requeued, not to retry it on a recent machine -- note this starts with LastMatch1, not 0, which is the current machine:
Requirements = target.name =!= LastMatchName1 && target.name =!= LastMatchName2 ...
One problem with this solution is that the LastMatchName attributes contain the slot number. Usually, a black hole machine has problems across all slots in one machine. The following recipe works for HTCondor 7.6 and later, but avoids the slot problem
Add the following line to your submit file
job_machine_attrs = Machine job_machine_attrs_history_length = 5 requirements = target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2
You may also the the HowToAutoRetryElsewhere recipe of interest.