Documentation of the rescue DAG/condor_rm and related code in DAGMan. Note that rescue DAGs have gone through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here. {subsection: Naming of rescue DAG files} If the original DAG file is _foo.dag_, the rescue DAG files will be named _foo.dag.rescue.001_, _foo.dag.rescue.002_, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default). {subsection: Format of rescue DAG} As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.) Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes. {subsection: Creation of rescue DAG} Rescue DAGs are created in three cases: *: The DAGMan job itself is condor_rm'ed *: A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress. *: Immediately after parsing (or attempting to parse) the DAG, if =-DumpRescue= is given on the command line. In the first two cases, =main_shutdown_rescue()= in _dagman_main.cpp_ gets called (if DAGMan is condor_rm'ed, =main_shutdown_rescue()= is called by =main_shutdown_remove()=, which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls =Dag::Rescue()=. Note that =main_shutdown_rescue()= uses the =inShutdownRescue= flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG. In the third case, =Dag::Rescue()= is called at various places in =main_init()= in _dagman_main.cpp_ (man, that function is too big!!). In any case, =Dag::Rescue()= calls =FindLastRescueDagNum()=, which checks all legal rescue DAG names and finds the highest-numbered one, and then =RescueDagName()=, which creates a properly-formatted rescue DAG file name, and then =WriteRescue()=, which actually writes the rescue DAG. I think =WriteRescue()= is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true. =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one. The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read. Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG (which is the default mode). {subsection: Use/parsing of rescue DAG} =main_init()= (that huge function again!) calls =FindLastRescueDagNum()= to find out if there are any rescue DAGs for the current DAG. (Oh, yeah -- the user now has to run a rescue DAG by re-submitting the original DAG, not by submitting the rescue DAG directly as they did a long time ago.) The user can also specify =-DoRescueFrom = on the command line to specify a rescue DAG to run, if they don't want to run the latest one. If you specify a rescue DAG number, any later rescue DAGs are renamed by =RenameRescueDagsAfter()= (_foo.dag.rescue.005_ -> _foo.dag.rescue.005.old_, etc.). After parsing the original DAG file(s), DAGMan then parses the rescue DAG (which just sets the status of various nodes to DONE as appropriate, and also possibly changes the number of retries left on some nodes). At that point, we're ready to actually run the DAG. {subsection: Additional actions on condor_rm of DAGMan} In =main_shutdown_rescue()=, if there are any node jobs in the queue, we call =Dag::RemoveRunningJobs()= to remove them. If any PRE or POST scripts are running, we call =RemoveRunningScripts()= to kill them. (Note that in the case of DAGMan having made all the progress it can in the face of node failures, there won't be any node jobs or scripts running at this point.) =Dag::RemoveRunningJobs()= removes any HTCondor jobs by using the constraint _"DAGManJobId == "_; any Stork jobs are removed individually. =Dag::RemoveRunningScripts()= iterates through all of the nodes, and individually kills any running scripts via daemoncore. In =main_shutdown_rescue()= we also call =Dag::StartFinalNode()=, which starts the DAG final node if there is one and we haven't already started it.