Documentation of the rescue DAG/condor_rm and related code in DAGMan. -Note that rescue DAGs have done through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here. +Note that rescue DAGs have gone through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here. {subsection: Naming of rescue DAG files} @@ -26,7 +26,7 @@ In any case, =Dag::Rescue()= calls =FindLastRescueDagNum()=, which checks all legal rescue DAG names and finds the highest-numbered one, and then =RescueDagName()=, which creates a properly-formatted rescue DAG file name, and then =WriteRescue()=, which actually writes the rescue DAG. -I think =WriteRescue()= is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true. =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one. The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read. Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG. +I think =WriteRescue()= is pretty straightforward. The main complication is that there are a bunch of places where we don't write stuff out if =isPartial= is true. =WriteRescue()= iterates through all of the nodes and calls =WriteNodeToRescue()= on each one. The main reason for breaking =WriteNodeToRescue()= out of =WriteRescue()= was to reduce the excessive indentation, and make the code easier to read. Note that this =isPartial= flag is passed to =WriteNodeToRescue()=, because most of the info printed by =WriteNodeToRescue()= is _not_ printed to a partial rescue DAG (which is the default mode). {subsection: Use/parsing of rescue DAG} @@ -35,5 +35,11 @@ After parsing the original DAG file(s), DAGMan then parses the rescue DAG (which just sets the status of various nodes to DONE as appropriate, and also possibly changes the number of retries left on some nodes). At that point, we're ready to actually run the DAG. {subsection: Additional actions on condor_rm of DAGMan} -remove jobs -final node + +In =main_shutdown_rescue()=, if there are any node jobs in the queue, we call =Dag::RemoveRunningJobs()= to remove them. If any PRE or POST scripts are running, we call =RemoveRunningScripts()= to kill them. (Note that in the case of DAGMan having made all the progress it can in the face of node failures, there won't be any node jobs or scripts running at this point.) + +=Dag::RemoveRunningJobs()= removes any Condor jobs by using the constraint _"DAGManJobId == <id>"_; any Stork jobs must are removed individually. + +=Dag::RemoveRunningScripts()= iterates through all of the nodes, and individually kills any running scripts via daemoncore. + +In =main_shutdown_rescue()= we also call =Dag::StartFinalNode()=, which starts the DAG final node if there is one and we haven't already started it.