Page History

Turn Off History

A question that is often asked about DAGs is this: how can I make two or more DAG node jobs run on the same machine? (Typically this is because the first job generates large temporary files that are to be used by the second job, and it is desirable to avoid having to transfer those files between machines.)

This document presents a way to ensure that consecutive DAG nodes run on the same machine, without pre-selecting a specific machine.

(Note: this document refers to a jobs for a parent and a child node running on the same machine; but the scheme can be easily extended to any number of descendants.)

The basic scheme:

The parent node has a POST script that determines which machine the node job ran on; this script then outputs a file containing the machine name, and that file is incorporated into the submit file of the child node job (and any subsequent jobs that should run on that machine) in such a way that the child job is required to run on that machine.

This scheme takes advantage of claim leases (see http://research.cs.wisc.edu/htcondor/manual/v8.5/3_5Policy_Configuration.html#35520), which DAGMan automatically sets to last for 20 seconds after the end of a node job, if the node has children. The claim lease allows a second job from the same user to run on the execute machine without having to go through a negotiation cycle.

(Note that the duration of the claim lease can be changed by changing the setting of the DAGMAN_HOLD_CLAIM_TIME configuration macro, as in the example below.)

Caveats:

Example:

(This is a shortened version of the example in the attached tar file.)

# File: example.dag

  config example.config

  job A0 nodeA.sub
  vars A0 node="$(JOB)"
  script post A0 post.pl B0 $JOBID

  job B0 nodeB0.sub
  vars B0 node="$(JOB)"

  parent A0 child B0

# File: example.config

  DAGMAN_HOLD_CLAIM_TIME = 60

# File: nodeA.sub

  executable = /bin/hostname
  output = $(node).out
  queue

# File: nodeB0.sub

  executable = /bin/hostname
  output = $(node).out
  # Unfortunately, we can't use $(node) in the include file name.
  include : B0.inc
  requirements = TARGET.Name == $(my_machine)
  queue

# File: post.pl

  #!/usr/bin/env perl

  $outfile = $ARGV[0] . ".inc";

  if (-e $outfile) {
         system("rm -f $outfile");
  }

  open(OUT, ">$outfile") or die "Couldn't open output file $outfile: $!";
  $host = `condor_history $ARGV[1] -af LastRemoteHost -limit 1`;
  chomp $host;
  print OUT "my_machine = \"$host\"\n";
  close(OUT);

On *nix systems, you should be able to copy and paste the example above, or download the attached example tar file, and run the example DAG without modification. On Windows you will have to make some small changes to the example for it to work.

Attachments: