HTCondorWiki: Rescue Dag

Page History

Documentation of the rescue DAG/condor_rm and related code in DAGMan.

Note that rescue DAGs have done through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here.

Naming of rescue DAG files

If the original DAG file is foo.dag, the rescue DAG files will be named foo.dag.rescue.001, foo.dag.rescue.002, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default).

Format of rescue DAG

As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.)

Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes.

Creation of rescue DAG

Rescue DAGs are created in three cases:

The DAGMan job itself is condor_rm'ed
A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress.
Immediately after parsing (or attempting to parse) the DAG, if -DumpRescue is given on the command line.

In the first two cases, main_shutdown_rescue() in dagman_main.cpp gets called (if DAGMan is condor_rm'ed, main_shutdown_rescue() is called by main_shutdown_remove(), which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls Dag::Rescue(). Note that main_shutdown_rescue() uses the inShutdownRescue flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG.

In the third case, Dag::Rescue() is called at various places in main_init() in dagman_main.cpp (man, that function is too big!!).

In any case, Dag::Rescue() calls FindLastRescueDagNum(), which checks all legal rescue DAG names and finds the highest-numbered one, and then RescueDagName(), which creates a properly-formatted rescue DAG file name, and then WriteRescue(), which actually writes the rescue DAG.

I think WriteRescue() is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if isPartial is true. WriteRescue() iterates through all of the nodes and calls WriteNodeToRescue() on each one. The main reason for breaking WriteNodeToRescue() out of WriteRescue() was to reduce the excessive indentation, and make the code easier to read. Note that this isPartial flag is passed to WriteNodeToRescue(), because most of the info printed by WriteNodeToRescue() is not printed to a partial rescue DAG.

Use/parsing of rescue DAG

Additional actions on condor_rm of DAGMan

remove jobs final node