{subsection: Naming of rescue DAG files} -If the original DAG file is foo.dag, the rescue DAG files will be named foo.dag.rescue.001, foo.dag.rescue.002, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default). If there are any gaps in the numbering sequence (e.g., the user deletes an intermediate file) the system will be goofed up. +If the original DAG file is _foo.dag_, the rescue DAG files will be named _foo.dag.rescue.001_, _foo.dag.rescue.002_, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default). {subsection: Format of rescue DAG} @@ -13,5 +13,24 @@ Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes. {subsection: Creation of rescue DAG} + +Rescue DAGs are created in three cases: + +*: The DAGMan job itself is condor_rm'ed +*: A node or nodes have failed, and the DAG has reached the point where it can make no more forward progress. +*: Immediately after parsing (or attempting to parse) the DAG, if =-DumpRescue= is given on the command line. + +In the first two cases, =main_shutdown_rescue()= in _dagman_main.cpp_ gets called (if DAGMan is condor_rm'ed, =main_shutdown_rescue()= is called by =main_shutdown_remove()=, which is called by daemoncore; otherwise, it's explicitly called at various places in the DAGMan code), and that calls =Dag::Rescue()=. Note that =main_shutdown_rescue()= uses the =inShutdownRescue= flag to make sure you don't get into a recursion if you have an error while trying to write the rescue DAG. + +In the third case, =Dag::Rescue()= is called at various places in =main_init()= in _dagman_main.cpp_ (man, that function is too big!!). + +In any case, =Dag::Rescue()= calls =FindLastRescueDagNum()=, which checks all legal rescue DAG names and finds the highest-numbered one, and then =RescueDagName()=, which creates a properly-formatted rescue DAG file name, and then =WriteRescue()=, which actually writes the rescue DAG. + +I think =WriteRescue()= is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true. =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one. The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read. Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG. + + + {subsection: Use/parsing of rescue DAG} {subsection: Additional actions on condor_rm of DAGMan} +remove jobs +final node