HTCondorWiki: Rescue Dag

Page History

Documentation of the rescue DAG/condor_rm and related code in DAGMan.

Note that rescue DAGs have done through a number of significant changes over the history of DAGMan. I'm only going to document the most recent version here.

Naming of rescue DAG files

If the original DAG file is foo.dag, the rescue DAG files will be named foo.dag.rescue.001, foo.dag.rescue.002, etc. Each time a rescue DAG is written, the rescue DAG number is incremented. When a rescue DAG is run, the highest-numbered rescue DAG is used (by default). If there are any gaps in the numbering sequence (e.g., the user deletes an intermediate file) the system will be goofed up.

Format of rescue DAG

As of version 7.7.2, rescue DAG files are no longer a complete DAG file as they used to be. Now, the rescue DAG only records the state of the DAG (which nodes are done, and remaining node retries) and must be parsed in conjunction with the original DAG file. This is done so that a user discovers an error in their DAG file when they have a rescue DAG, they only need to fix the original DAG file, rather than having to fix both the orginal file and the rescue DAG file. (The old behavior can be obtained by setting DAGMAN_WRITE_PARTIAL_RESCUE to false.)

Anyhow, in the default case, the only things (besides comments) in the rescue DAG file are lines marking nodes as done and lines resetting the remaining retries for nodes.

Rescue Dag

Page History

Naming of rescue DAG files

Format of rescue DAG

Creation of rescue DAG

Use/parsing of rescue DAG

Additional actions on condor_rm of DAGMan