Page History

Turn Off History

How to Rendezvous in the HTCondor Parallel Universe

Introduction

When running parallel universe jobs, often one node of a job would like to send some small amount of data to the other nodes in the job. This can be an IP address and port, or a filename, or some other object to synchronize on. HTCondor chirp command makes this straightforward. The basic idea is that the sending process writes the information into the job ad with condor_chirp sett_job_attr, and the receivers poll this information from the job classad with condor_chirp get_job_attr.

Example submit file

Assuming that the dedicated scheduler has been configured, create a submit file like this:

universe = parallel

executable = run.sh

should_transfer_files = yes
when_to_transfer_output = on_exit
getenv = true

output = out.$(Node)
error  = err.$(Node)
log    = log

machine_count = 2
+WantRemoteIO = true
queue

The run.sh script

The run.sh script first checks to see if it is the client or server. Arbitrarily, we pick node 0 for the server, and 1 for the client. Node 0 picks a string, inserts it into the job ad attribute called JobServerAddress, and sleeps. Node 1 polls for this value, prints it out when it is available, then exits. Obviously, this value could be the address of a port listening on, or anything else the two processes would like to synchronize on.

#!/bin/sh


CONDOR_CHIRP=`condor_config_val LIBEXEC`/condor_chirp

# Tell chirp where the chirp config file is
export _CONDOR_CHIRP_CONFIG=$_CONDOR_SCRATCH_DIR/.chirp_config

if [[ $_CONDOR_PROCNO = "0" ]]
then
    # I'm the server
    $CONDOR_CHIRP set_job_attr JobServerAddress '"This is my address"'

    # Do server things here...
    sleep 3600
    exit 0
fi


if [[ $_CONDOR_PROCNO = "1" ]]
then
    # I'm the client
    until $CONDOR_CHIRP get_job_attr JobServerAddress > /dev/null 2>&1
    do
        sleep 2
    done

    JobServerAddress=`$CONDOR_CHIRP get_job_attr JobServerAddress`

    # I've got the address, do client things here like connect to the server..
    echo "JobServerAddress is $JobServerAddress"
    exit 0
fi

exit 0