HTCondorWiki: Dag Recovery

Page History

Recovery mode is a mode in which DAGMan "catches up" with what has happened with node jobs of its DAG before actually starting up any jobs. Basically, we get into recovery mode if a DAG was running and DAGMan exited in a way that didn't start a rescue DAG, and then DAGMan was restarted. There are two main ways that this can happen:

condor_hold/condor_release of the condor_dagman job.
DAGMan exits without creating a rescue DAG because of something like an assertion.

You can also force DAGMan into recovery mode by creating an appropriate lock file and condor_submitting the .condor.sub file (instead of condor_submit_dagging the DAG file).

We figure out whether we're going to go into recovery mode in main_init() in dagman_main.cpp. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (foo.dag.lock if the primary DAG file is foo.dag). If the lock file exists, we call util_check_lock_file() in dagman_util.cpp. This attempts to instantiate a ProcessId object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file isn't running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to our process.

Things to mention:

condor_hold actually kills the DAGMan process; condor_release starts a new process (but the same Condor ID)
lock file (including Joe's Unique PID thing)
re-reading node job userlogs
FD problems on wide DAGs
a bunch of places where we do special stuff in recovery mode
recovery mode is totally separate from a rescue DAG; in fact, you can be in recovery mode while running a rescue DAG.
recovery mode really places a lot of constraints on the rest of the DAGMan code (e.g., need node names in submit events; inter-submit sleeps if using multiple logs; no macros in log file names for node jobs; probably a bunch more that I can't think of at the moment)