HTCondorWiki: Dag Recovery

Note: not yet complete!

Recovery mode is a mode in which DAGMan "catches up" with what has happened with the node jobs of its DAG before actually submitting any new jobs (it does the "catching up" by reading the user log files of the node jobs). Basically, we get into recovery mode if a DAG was running and DAGMan exited in a way that didn't create a rescue DAG, and then DAGMan was restarted. There are two main ways that this can happen:

condor_hold/condor_release of the condor_dagman job (condor_hold actually kills the condor_dagman process, but leaves the job in the queue; condor_release starts a new DAGMan process, which goes into recovery mode).
DAGMan exits without creating a rescue DAG because of something like an assertion.

You can also force DAGMan into recovery mode by creating an appropriate lock file and condor_submitting the .condor.sub file (instead of condor_submit_dagging the DAG file).

We figure out whether we're going to go into recovery mode in main_init() in dagman_main.cpp. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (foo.dag.lock if the primary DAG file is foo.dag). If the lock file exists, we call util_check_lock_file() in dagman_util.cpp. This attempts to instantiate a ProcessId object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file isn't running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to our process.

Once we've figured out whether we're in recovery mode, main_init() calls Dag::Bootstrap(). That does some things that are outside the scope of recovery mode, and then performs the recovery step. First of all, we turn on caching of dprintf() output (that improves performance significantly -- at least in the past, dprintf() opened and closed the file each time, so that was pretty slow in recovery mode, when you're trying to really write out a lot of stuff quickly). Then we monitor the log files for all jobs that are ready (there had better be some, or we have a cycle). Next we call ProcessLogEvents(), which reads the monitored log files. (more needed here)

Things to mention:

condor_hold actually kills the DAGMan process; condor_release starts a new process (but the same HTCondor ID)
lock file (including Joe's Unique PID thing)
re-reading node job userlogs
FD problems on wide DAGs (throttles don't help us in recovery mode)
a bunch of places where we do special stuff in recovery mode
recovery mode is totally separate from a rescue DAG; in fact, you can be in recovery mode while running a rescue DAG.
recovery mode really places a lot of constraints on the rest of the DAGMan code (e.g., need node names in submit events; inter-submit sleeps if using multiple logs; no macros in log file names for node jobs; probably a bunch more that I can't think of at the moment)
when DAGMan exits normally (whether successfully or not) it deletes the lock file
caching debug output
basically, in recovery mode, we monitor a new log file where we'd submit a new job in "normal" mode