-Recovery mode is a mode in which DAGMan "catches up" with what has happened with node jobs of its DAG before actually starting up any jobs. Basically, we get into recovery mode if a DAG was running and DAGMan exited in a way that didn't start a rescue DAG, and then DAGMan was restarted. There are two main ways that this can happen: +Recovery mode is a mode in which DAGMan "catches up" with what has happened with the node jobs of its DAG before actually submitting any new jobs. Basically, we get into recovery mode if a DAG was running and DAGMan exited in a way that didn't create a rescue DAG, and then DAGMan was restarted. There are two main ways that this can happen: *: condor_hold/condor_release of the condor_dagman job. *: DAGMan exits without creating a rescue DAG because of something like an assertion. You can also force DAGMan into recovery mode by creating an appropriate lock file and condor_submitting the .condor.sub file (instead of condor_submit_dagging the DAG file). -We figure out whether we're going to go into recovery mode in =main_init()= in _dagman_main.cpp_. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (_foo.dag.lock_ if the primary DAG file is _foo.dag_). If the lock file exists, we call util_check_lock_file() in _dagman_util.cpp_. This attempts to instantiate a =ProcessId= object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file _isn't_ running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to _our_ process. +We figure out whether we're going to go into recovery mode in =main_init()= in _dagman_main.cpp_. After parsing the DAG file(s) and any rescue DAG, we check for the existance of a lock file (_foo.dag.lock_ if the primary DAG file is _foo.dag_). If the lock file exists, we call =util_check_lock_file()= in _dagman_util.cpp_. This attempts to instantiate a =ProcessId= object from the lock file (this is Joe Meehean's unique PID thing, which tries to avoid the possibility of having a PID refer to the wrong process). If the process that wrote the lock file is running, we will exit (we don't want to have two instances of the same DAG running at the same time; see the log file section). If the process that wrote the lock file _isn't_ running, or we weren't able to construct the ProcessID object, we continue. In that case, we call util_create_lock_file() to write a lock file containing a serialized ProcessID object corresponding to _our_ process. Things to mention: @@ -16,3 +16,4 @@ *: a bunch of places where we do special stuff in recovery mode *: recovery mode is totally separate from a rescue DAG; in fact, you can be in recovery mode while running a rescue DAG. *: recovery mode really places a lot of constraints on the rest of the DAGMan code (e.g., need node names in submit events; inter-submit sleeps if using multiple logs; no macros in log file names for node jobs; probably a bunch more that I can't think of at the moment) +*: when DAGMan exits normally (whether successfully or not) it deletes the lock file