-(In progress.) +A question that is often asked about DAGs is this: how can I make +two or more DAG node jobs run on the same machine? (Typically this +is because the first job generates large temporary files that are +to be used by the second job, and it is desirable to avoid having +to transfer those files between machines.) + +This document presents a way to ensure that consecutive DAG nodes +run on the same machine, without pre-selecting a specific machine. + +(Note: this document refers to a jobs for a parent and a child node +running on the same machine; but the scheme can be easily extended +to any number of descendants.) + +*The basic scheme*: + +The parent node has a POST script that determines which machine the +node job ran on; this script then outputs a file containing the +machine name, and that file is incorporated into the submit file of +the child node job (and any subsequent jobs that should run on that +machine) in such a way that the child job is required to run on +that machine. + +This scheme takes advantage of claim leases (see +http://research.cs.wisc.edu/htcondor/manual/v8.5/3_5Policy_Configuration.html#35520), +which DAGMan automatically sets to last for 20 seconds after the end +of a node job, if the node has children. The claim lease allows a +second job from the same user to run on the execute machine without +having to go through a negotiation cycle. + +(Note that the duration of the claim lease can be changed by changing +the setting of the DAGMAN_HOLD_CLAIM_TIME configuration macro, as +in the example below.) + +*Caveats*: + +*: It is always possible for another job submitted by the same user +to "steal" the machine before the child job runs on it. (In other +words, the child job is required to run on the designated machine, +but that machine is not required to run only the child job.) +(The job that does the "stealing" could be an entirely separate +job submitted by the same user, or another job from the same +DAG, if that job is not assigned to a specific machine.) + +*: (If DAGMAN_HOLD_CLAIM_TIME is sufficiently long, it is not possible +for a job from a different user to "steal" the machine.) + +*: For this scheme to work, the child job must have requirements +that will match whatever machine the parent job runs on. (Ideally, +the two jobs should have identical requirements.) If the child +job's requirements do *not* match the machine the parent ran on +(for example, the child job requires more memory), the child job +will never run, and, therefore, the DAG will never complete. + +*: Also, if the relevant machine encounters some problem that prevents +the child job from running, the DAG will never complete, because the +child job is not allowed to be run by any other machine. + +*: The jobs to be run on the same machine should belong to nodes +that have an immediate parent/child relationship in the DAG -- +otherwise the claim lease is likely to expire before the child +job can run, if the machine is not "stolen" by another job. + +*: The submit file include feature that this example uses was +only added in version 8.3.5. If you are running an earlier +version of HTCondor, you can use a similar scheme, but the parent +node POST script would have to write out the entire submit file +for the child node job. + +*Example*: + +(This is a shortened version of the example in the attached tar file.) + +# File: example.dag + + config example.config + + job A0 nodeA.sub + vars A0 node="$(JOB)" + script post A0 post.pl B0 $JOBID + + job B0 nodeB0.sub + vars B0 node="$(JOB)" + + parent A0 child B0 + +# File: example.config + + DAGMAN_HOLD_CLAIM_TIME = 60 + +# File: nodeA.sub + + executable = /bin/hostname + output = $(node).out + queue + +# File: nodeB0.sub + + executable = /bin/hostname + output = $(node).out + # Unfortunately, we can't use $(node) in the include file name. + include : B0.inc + requirements = TARGET.Name == $(my_machine) + queue + +# File: post.pl + + #!/usr/bin/env perl + + $outfile = $ARGV[0] . ".inc"; + + if (-e $outfile) { + system("rm -f $outfile"); + } + + open(OUT, ">$outfile") or die "Couldn't open output file $outfile: $!"; + $host = `condor_history $ARGV[1] -af LastRemoteHost -limit 1`; + chomp $host; + print OUT "my_machine = \"$host\"\n"; + close(OUT); + +On *nix systems, you should be able to copy and paste the example +above, or download the attached example tar file, and run the +example DAG without modification. On Windows you will have to +make some small changes to the example for it to work.