Page History

Turn Off History

How to Rendezvous in the HTCondor Parallel Universe

Introduction

When running parallel universe jobs, often one node of a job would like to send some small amount of data to the other nodes in the job. This can be an IP address and port, or a filename, or some other object to synchronize on. HTCondor chirp command makes this straightforward. The basic idea is that the sending process writes the information into the job ad with condor_chirp sett_job_attr, and the receivers poll this information from the job classad with condor_chirp get_job_attr.

Example submit file

Assuming that the dedicated scheduler has been configured, create a submit file like this:

universe = parallel

executable = run.sh

should_transfer_files = yes
when_to_transfer_output = on_exit
getenv = true

output = out.$(Node)
error  = err.$(Node)
log    = log

machine_count = 2
+WantIOProxy = true
queue

The run.sh script

The run.sh script first checks to see if it is the client or server. Arbitrarily, we pick node 0 for the server, and 1 for the client. Node 0 picks a string, inserts it into the job ad attribute called JobServerAddress, and sleeps. Node 1 polls for this value, prints it out when it is available, then exits. Obviously, this value could be the address of a port listening on, or anything else the two processes would like to synchronize on.

#!/bin/sh


CONDOR_CHIRP=`condor_config_val LIBEXEC`/condor_chirp

# Tell chirp where the chirp config file is
export _CONDOR_CHIRP_CONFIG=$_CONDOR_SCRATCH_DIR/.chirp_config

if [[ $_CONDOR_PROCNO = "0" ]]
then
    # I'm the server
    $CONDOR_CHIRP set_job_attr JobServerAddress '"This is my address"'

    # Do server things here...
    sleep 3600
    exit 0
fi


if [[ $_CONDOR_PROCNO = "1" ]]
then
    # I'm the client
    until $CONDOR_CHIRP get_job_attr JobServerAddress > /dev/null 2>&1
    do
        sleep 2
    done

    JobServerAddress=`$CONDOR_CHIRP get_job_attr JobServerAddress`

    # I've got the address, do client things here like connect to the server..
    echo "JobServerAddress is $JobServerAddress"
    exit 0
fi

exit 0

Copying a file from one node to another

Here is a more sophisticated example, using the nc program (netcat) to copy a file from one machine to another. Note that there is no security at all in this file copy, any process on the network can connect and send data to the listening process, so you'd only want to run this on a trusted, protected network.

The submit file is the same as above, but the run.sh is a bit more involved:

#!/bin/sh

CONDOR_CHIRP=`condor_config_val LIBEXEC`/condor_chirp

# Tell chirp where the chirp config file is
export _CONDOR_CHIRP_CONFIG=$_CONDOR_SCRATCH_DIR/.chirp_config

if [[ $_CONDOR_PROCNO = "0" ]]
then

    # I'm the server

    # Start netcat listening on an ephemeral port (0 means kernels picks port)
    #    It will wait for a connection, then write that data to output_file

    nc -l 0 > output_file &

    # pid of the nc running in the background
    NCPID=$!

    # Sleep a bit to ensure nc is running
    sleep 2

    # parse the actual port selected from netstat output
    NCPORT=`
        netstat -t -a -p 2>/dev/null |
        grep " $NCPID/nc" |
        awk -F: '{print $2}' | awk '{print $1}'
    `

    # grab the hostname
    HOSTNAME=`hostname`

    $CONDOR_CHIRP set_job_attr JobServerAddress \"${HOSTNAME}\ ${NCPORT}\"

    # Do other server things here...
    sleep 3600
    exit 0
fi


if [[ $_CONDOR_PROCNO = "1" ]]
then

    # I'm the client

    sleep 20
    # Poll until the job attribute appears in the ad

    until $CONDOR_CHIRP get_job_attr JobServerAddress > /tmp/null 2>&1
    do
        sleep 2
    done

    JobServerAddress=`$CONDOR_CHIRP get_job_attr JobServerAddress`

    host=`echo $JobServerAddress | tr -d '"' | awk '{print $1}'`
    port=`echo $JobServerAddress | tr -d '"' | awk '{print $2}'`

    # Send the /etc/hosts file to the other side
    nc $host $port < /etc/hosts

    # Don't do any other client stuff, just exit
    exit 0
fi

exit 0