We figure out whether we're going to go into recovery mode in =main_init()= in _dagman_main.cpp_. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (_foo.dag.lock_ if the primary DAG file is _foo.dag_). If the lock file exists, we call =util_check_lock_file()= in _dagman_util.cpp_. This attempts to instantiate a =ProcessId= object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file _isn't_ running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to _our_ process. +Once we've figured out whether we're in recovery mode, =main_init()= calls =Dag::Bootstrap()=. That does some things that are outside the scope of recovery mode, and then performs the recovery step. (add more detail here) + Things to mention: *: condor_hold actually kills the DAGMan process; condor_release starts a new process (but the same Condor ID) @@ -17,3 +19,4 @@ *: recovery mode is totally separate from a rescue DAG; in fact, you can be in recovery mode while running a rescue DAG. *: recovery mode really places a lot of constraints on the rest of the DAGMan code (e.g., need node names in submit events; inter-submit sleeps if using multiple logs; no macros in log file names for node jobs; probably a bunch more that I can't think of at the moment) *: when DAGMan exits normally (whether successfully or not) it deletes the lock file +*: caching debug output