Ticket #2165: Rescue DAGs should not be a complete DAG file?

In today's discussion with the botany people, one of the things we talked about is why they don't use rescue DAGs more. It turns out that at least one reason is that when they fix things up after a failure, they often want to make slight changes to the DAG file itself (renaming the DAG file of a subdag external, for example). But you can't really do this in a clean way, because if you edit the original DAG file, the change won't take effect when the rescue DAG is run; and, of course, if you edit the rescue DAG, you lose the change if the original DAG is re-run from scratch.

So I think the solution to this is that the rescue DAG file should no longer be an actual DAG file; rather, it should just be a list of which nodes are marked as done. So then, when you run a rescue DAG, the original DAG file would be parsed, and then the appropriate (latest) rescue DAG file would be read, and the nodes would be marked as done appropriately.

Note that if we do this, it would probably mean eliminating the -oldrescue option that we currently support.

[Append remarks]


2011-May-20 09:28:16 by wenger:
Oh, yeah -- the rescue DAG would also have to contain updated retry counts. Maybe it should have the number of retries already used, rather than the number of retries left?

2011-May-20 09:40:13 by nwp:
Doesn't it keep the state more localized if we just write out the number of retries left? If you count the ones already used, then you have to do some accounting to find the number left, whereas if you write the number of retries left into the rescue dag, you don't need to do the same bookkeeping.

2011-May-20 10:04:33 by wenger:
Well, the main reason I was thinking of writing out the number of retries used is in case the user changes the retry count in the "main" DAG file before re-running the DAG. But I think this is a fairly minor consideration -- the main thing is that the "new" rescue DAG would have to have something about retries, as well as which nodes are done.

2011-Jun-10 14:15:42 by wenger:
Created V7_7-new_rescue_dag-branch for this.

2011-Aug-24 09:37:29 by wenger:
Okay, I think the code for this is all set (on the V7_7-new_rescue_dag-branch branch). I'm going to assign it over to Nathan for review. Nathan -- if you think the code is okay, you should set the status to "documentation", not "resolved", because I don't have the documentation done yet.

2011-Aug-24 11:18:29 by nwp:
Reviewed patch. Everything looks good, except for some stylistic quibbles idiosyncratic to the reviewer.

2011-Aug-25 11:04:02 by wenger:
I fixed the things Nathan pointed out in his review, and merged this to the master (after fixing some really nasty conflicts where git did a bad job with the merge).

This still needs documentation...

2011-Aug-31 13:16:45 by wenger:
Okay, I've finished the documentation (pending any fixes from Karen) so I am going to resolve this ticket.
[Append remarks]


Type: enhance           Last Change: 2011-Aug-31 13:17
Status: resolved          Created: 2011-May-17 16:51
Fixed Version: v070702           Broken Version: v070600 
Priority:          Subsystem: Dag 
Assigned To: wenger           Derived From:  
Creator: wenger  Rust:  
Customer Group: chtc  Visibility: public 
Notify: wenger@cs.wisc.edu, psilord@cs.wisc.edu  Due Date: 20110826 

Related Check-ins:

2011-Sep-02 09:35   Check-in [27079]: Last documentation detail of #2165: changes to Rescue DAG implementation; removed no-longer-present command line arguments from condor_dagman and condor_submit_dag man pages. ===GT=== #2165 (By Karen Miller )
2011-Sep-01 11:31   Check-in [27054]: edits of new partial Rescue DAG implementation documentation ===GT=== #2165 (By Karen Miller )
2011-Aug-31 13:09   Check-in [27043]: Gittrac #2165: documented this (partial rescue DAG files). Also added documentation for the DAGMAN_RESET_RETRIES_UPON_RESCUE configuration variable, which was missing for some reason. Fixed some errors in documentation of the -debug option to condor_dagman and condor_submit_dag. (By Kent Wenger )
2011-Aug-31 09:15   Check-in [27037]: Gittrac #2165: the version information in the rescue DAG now includes whether the DAG is a full or partial DAG file. (By Kent Wenger )
2011-Aug-30 14:50   Check-in [27032]: Remove these tests from master The always-run-post-script code in #2057 has not been merged yet. This was changed in work for #2165 also. (By Nathan W. Panike )
2011-Aug-30 14:10   Check-in [27028]: Gittrac #2165: added the DAGMAN_WRITE_PARTIAL_RESCUE config macro, which allows me to fix job_dagman_pre_skip-C (which was job_dagman_pre_skip-D) and job_dagman_retry (re-enabled job_dagman_retry as a result). (By Kent Wenger )
2011-Aug-25 10:29   Check-in [26974]: Merged [22152], [22153], [22155], [22159], [22160], [22165], [22166], [22167], [22219], [22231], [22368], [22371], [26716], [26734], [26735], [26742], [26924], [26925], [26938], Merge branch 'V7_7-new_rescue_dag-branch' [...] (By Kent Wenger )
2011-Aug-24 13:29   Check-in [26938]: Gittrac #2165: slight changes based on review by Nathan. (By Kent Wenger )
2011-Aug-24 09:26   Check-in [26925]: Gittrac #2165: DAGMan now passes all tests (except for job_dagman_retry, and I can't even figure out what that's supposed to be doing, except that it relies on old-style rescue DAGs, so I'm disabling it). (By Kent Wenger )
2011-Aug-10 17:25   Check-in [26734]: Gittrac #2165: partly done getting things to work with splices. (By Kent Wenger )
2011-Aug-10 15:10   Check-in [26716]: Gittrac #2165: DONE line for a non-existant node is now a warning, in case the user removes nodes from the original DAG; extra tokens on a DONE line are now an error; manually tested -DumpRescue and rescue DAG with multiple DAGs. (By Kent Wenger )