In today's discussion with the botany people, one of the things we talked about is why they don't use rescue DAGs more. It turns out that at least one reason is that when they fix things up after a failure, they often want to make slight changes to the DAG file itself (renaming the DAG file of a subdag external, for example). But you can't really do this in a clean way, because if you edit the original DAG file, the change won't take effect when the rescue DAG is run; and, of course, if you edit the rescue DAG, you lose the change if the original DAG is re-run from scratch.

So I think the solution to this is that the rescue DAG file should no longer be an actual DAG file; rather, it should just be a list of which nodes are marked as done. So then, when you run a rescue DAG, the original DAG file would be parsed, and then the appropriate (latest) rescue DAG file would be read, and the nodes would be marked as done appropriately.

Note that if we do this, it would probably mean eliminating the -oldrescue option that we currently support.

2011-May-20 09:28:16 by wenger:
Oh, yeah -- the rescue DAG would also have to contain updated retry counts. Maybe it should have the number of retries already used, rather than the number of retries left?

2011-May-20 09:40:13 by nwp:
Doesn't it keep the state more localized if we just write out the number of retries left? If you count the ones already used, then you have to do some accounting to find the number left, whereas if you write the number of retries left into the rescue dag, you don't need to do the same bookkeeping.

2011-May-20 10:04:33 by wenger:
Well, the main reason I was thinking of writing out the number of retries used is in case the user changes the retry count in the "main" DAG file before re-running the DAG. But I think this is a fairly minor consideration -- the main thing is that the "new" rescue DAG would have to have something about retries, as well as which nodes are done.

2011-Jun-10 14:15:42 by wenger:
Created V7_7-new_rescue_dag-branch for this.

2011-Aug-24 09:37:29 by wenger:
Okay, I think the code for this is all set (on the V7_7-new_rescue_dag-branch branch). I'm going to assign it over to Nathan for review. Nathan -- if you think the code is okay, you should set the status to "documentation", not "resolved", because I don't have the documentation done yet.

2011-Aug-24 11:18:29 by nwp:
Reviewed patch. Everything looks good, except for some stylistic quibbles idiosyncratic to the reviewer.

2011-Aug-25 11:04:02 by wenger:
I fixed the things Nathan pointed out in his review, and merged this to the master (after fixing some really nasty conflicts where git did a bad job with the merge).

This still needs documentation...

2011-Aug-31 13:16:45 by wenger:
Okay, I've finished the documentation (pending any fixes from Karen) so I am going to resolve this ticket.
