HTCondorWiki: Rescue Dag

 
 {subsection: Naming of rescue DAG files}
 
-If the original DAG file is foo.dag, the rescue DAG files will be named foo.dag.rescue.001, foo.dag.rescue.002, etc.  Each time a rescue DAG is written, the rescue DAG number is incremented.  When a rescue DAG is run, the highest-numbered rescue DAG is used (by default).  If there are any gaps in the numbering sequence (e.g., the user deletes an intermediate file) the system will be goofed up.
+If the original DAG file is _foo.dag_, the rescue DAG files will be named _foo.dag.rescue.001_, _foo.dag.rescue.002_, etc.  Each time a rescue DAG is written, the rescue DAG number is incremented.  When a rescue DAG is run, the highest-numbered rescue DAG is used (by default).
 
 {subsection: Format of rescue DAG}
 
@@ -13,5 +13,24 @@
 Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes.
 
 {subsection: Creation of rescue DAG}
+
+Rescue DAGs are created in three cases:
+
+*: The DAGMan job itself is condor_rm'ed
+*: A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress.
+*: Immediately after parsing (or attempting to parse) the DAG, if =-DumpRescue= is given on the command line.
+
+In the first two cases, =main_shutdown_rescue()= in _dagman_main.cpp_ gets called (if DAGMan is condor_rm'ed, =main_shutdown_rescue()= is called by =main_shutdown_remove()=, which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls =Dag::Rescue()=.  Note that =main_shutdown_rescue()= uses the =inShutdownRescue= flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG.
+
+In the third case, =Dag::Rescue()= is called at various places in =main_init()= in _dagman_main.cpp_ (man, that function is too big!!).
+
+In any case, =Dag::Rescue()= calls =FindLastRescueDagNum()=, which checks all legal rescue DAG names and finds the highest-numbered one, and then =RescueDagName()=, which creates a properly-formatted rescue DAG file name, and then =WriteRescue()=, which actually writes the rescue DAG.
+
+I think =WriteRescue()= is pretty straightforward.  The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true.  =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one.  The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read.  Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG.
+
+
+
 {subsection: Use/parsing of rescue DAG}
 {subsection: Additional actions on condor_rm of DAGMan}
+remove jobs
+final node