Note: not yet complete!

Recovery mode is a mode in which DAGMan "catches up" with what has happened with the node jobs of its DAG before actually submitting any new jobs (it does the "catching up" by reading the user log files of the node jobs). Basically, we get into recovery mode if a DAG was running and DAGMan exited in a way that didn't create a rescue DAG, and then DAGMan was restarted. There are two main ways that this can happen:

You can also force DAGMan into recovery mode by creating an appropriate lock file and condor_submitting the .condor.sub file (instead of condor_submit_dagging the DAG file).

We figure out whether we're going to go into recovery mode in main_init() in dagman_main.cpp. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (foo.dag.lock if the primary DAG file is foo.dag). If the lock file exists, we call util_check_lock_file() in dagman_util.cpp. This attempts to instantiate a ProcessId object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file isn't running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to our process.

Once we've figured out whether we're in recovery mode, main_init() calls Dag::Bootstrap(). That does some things that are outside the scope of recovery mode, and then performs the recovery step. First of all, we turn on caching of dprintf() output (that improves performance significantly -- at least in the past, dprintf() opened and closed the file each time, so that was pretty slow in recovery mode, when you're trying to really write out a lot of stuff quickly). Then we monitor the log files for all jobs that are ready (there had better be some, or we have a cycle). Next we call ProcessLogEvents(), which reads the monitored log files. (more needed here)

Things to mention: