HTCondorWiki: Rescue Dag

Documentation of the rescue DAG/condor_rm and related code in DAGMan.

Note that rescue DAGs have gone through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here.

Naming of rescue DAG files

If the original DAG file is foo.dag, the rescue DAG files will be named foo.dag.rescue.001, foo.dag.rescue.002, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default).

Format of rescue DAG

As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.)

Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes.

Creation of rescue DAG

Rescue DAGs are created in three cases:

The DAGMan job itself is condor_rm'ed
A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress.
Immediately after parsing (or attempting to parse) the DAG, if -DumpRescue is given on the command line.

In the first two cases, main_shutdown_rescue() in dagman_main.cpp gets called (if DAGMan is condor_rm'ed, main_shutdown_rescue() is called by main_shutdown_remove(), which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls Dag::Rescue(). Note that main_shutdown_rescue() uses the inShutdownRescue flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG.

In the third case, Dag::Rescue() is called at various places in main_init() in dagman_main.cpp (man, that function is too big!!).

In any case, Dag::Rescue() calls FindLastRescueDagNum(), which checks all legal rescue DAG names and finds the highest-numbered one, and then RescueDagName(), which creates a properly-formatted rescue DAG file name, and then WriteRescue(), which actually writes the rescue DAG.

I think WriteRescue() is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if isPartial is true. WriteRescue() iterates through all of the nodes and calls WriteNodeToRescue() on each one. The main reason for breaking WriteNodeToRescue() out of WriteRescue() was to reduce the excessive indentation, and make the code easier to read. Note that this isPartial flag is passed to WriteNodeToRescue(), because most of the info printed by WriteNodeToRescue() is not printed to a partial rescue DAG (which is the default mode).

Use/parsing of rescue DAG

main_init() (that huge function again!) calls FindLastRescueDagNum() to find out if there are any rescue DAGs for the current DAG. (Oh, yeah -- the user now has to run a rescue DAG by re-submitting the original DAG, not by submitting the rescue DAG directly as they did a long time ago.) The user can also specify -DoRescueFrom <number> on the command line to specify a rescue DAG to run, if they don't want to run the latest one. If you specify a rescue DAG number, any later rescue DAGs are renamed by RenameRescueDagsAfter() (foo.dag.rescue.005 -> foo.dag.rescue.005.old, etc.).

After parsing the original DAG file(s), DAGMan then parses the rescue DAG (which just sets the status of various nodes to DONE as appropriate, and also possibly changes the number of retries left on some nodes). At that point, we're ready to actually run the DAG.

Additional actions on condor_rm of DAGMan

In main_shutdown_rescue(), if there are any node jobs in the queue, we call Dag::RemoveRunningJobs() to remove them. If any PRE or POST scripts are running, we call RemoveRunningScripts() to kill them. (Note that in the case of DAGMan having made all the progress it can in the face of node failures, there won't be any node jobs or scripts running at this point.)

Dag::RemoveRunningJobs() removes any HTCondor jobs by using the constraint "DAGManJobId == <id>"; any Stork jobs are removed individually.

Dag::RemoveRunningScripts() iterates through all of the nodes, and individually kills any running scripts via daemoncore.

In main_shutdown_rescue() we also call Dag::StartFinalNode(), which starts the DAG final node if there is one and we haven't already started it.