Documentation of the rescue DAG/condor_rm and related code in DAGMan. Note that rescue DAGs have done through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here. {subsection: Naming of rescue DAG files} If the original DAG file is _foo.dag_, the rescue DAG files will be named _foo.dag.rescue.001_, _foo.dag.rescue.002_, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default). {subsection: Format of rescue DAG} As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.) Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes. {subsection: Creation of rescue DAG} Rescue DAGs are created in three cases: *: The DAGMan job itself is condor_rm'ed *: A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress. *: Immediately after parsing (or attempting to parse) the DAG, if =-DumpRescue= is given on the command line. In the first two cases, =main_shutdown_rescue()= in _dagman_main.cpp_ gets called (if DAGMan is condor_rm'ed, =main_shutdown_rescue()= is called by =main_shutdown_remove()=, which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls =Dag::Rescue()=. Note that =main_shutdown_rescue()= uses the =inShutdownRescue= flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG. In the third case, =Dag::Rescue()= is called at various places in =main_init()= in _dagman_main.cpp_ (man, that function is too big!!). In any case, =Dag::Rescue()= calls =FindLastRescueDagNum()=, which checks all legal rescue DAG names and finds the highest-numbered one, and then =RescueDagName()=, which creates a properly-formatted rescue DAG file name, and then =WriteRescue()=, which actually writes the rescue DAG. I think =WriteRescue()= is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true. =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one. The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read. Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG. {subsection: Use/parsing of rescue DAG} {subsection: Additional actions on condor_rm of DAGMan} remove jobs final node