HTCondorWiki: How To Use Condor Annex With On Demand Instances Eight Seven Three

[If you're using v8.7.2 or earlier, see the v8.7.2 instructions. These instructions are for v8.7.3. If you're using v8.7.4, see the v8.7.4 instructions.]

Using HTCondor Annex for the First Time

We assume you already have an AWS account, as well as a log-in account on a Linux machine with a public IP address and an administrator who's willing to open a port for you. If so, you can follow the instructions for using HTCondor Annex for the first time.

If you're not sure if you've configured condor_annex before, you may enter the following to check.

$ . ~/condor-8.7.3/condor.sh
$ condor_annex -check-setup
Checking for configuration bucket... OK.
Checking for Lambda functions... OK.
Checking for instance profile... OK.
Checking for security group... OK.
Your setup looks OK.

If you see anything else, follow the instructions above.

What You'll Need to Know

To create a HTCondor annex with on-demand instances, you'll need to know two things:

A name for it. "MyFirstAnnex" is a fine name for your first annex.
How many instances you want. For your first annex, when you're checking to make sure things work, you may only want one instance.

Start an Annex

Entering the following will start an annex named "MyFirstAnnex" with one instance. condor_annex will print out what it's going to do, and then ask you if that's OK. You must type 'yes' (and hit enter) at the prompt to start an annex; if you do not, condor_annex will print out instructions about how to change whatever you may not like about what it said it was going to do, and then exit.

$ condor_annex -count 1 -annex-name MyFirstAnnex
Will request 1 m4.large on-demand instance for 0.83 hours.  Each instance will terminate after being idle for 0.25 hours.
Is that OK?  (Type 'yes' or 'no'): yes
Starting annex...
Annex started.  Its identity with the cloud provider is
'TestAnnex0_f2923fd1-3cad-47f3-8e19-fff9988ddacf'.  It will take about three minutes for the new machines to join the pool.

You won't need to know the annex's identity with the cloud provider unless something goes long.

Before starting the annex, condor_annex will check to make sure that the instances will be able to contact your pool. Contact your machine's administrator if condor_annex reports a problem with this step.

instance types

Each instance type provides a different number (and/or type) of CPU cores, amount of RAM, local storage, and the like. If you're not sure, we recommend starting with 'm4.large', which has 2 CPU cores and 8 GiB of RAM. As noted in the example output above, you can specify an instance type with the -aws-on-demand-instance-type flag.

leases

By default, condor_annex arranges for your annex's instances to be terminated after 0.83 hours (50 minutes) have passed. Once it's in place, this lease doesn't depend on your machine, but it's only checked every five minutes, so give your deadlines a lot of cushion to make you don't get charged for an extra hour. The lease is intended to help you conserve money by preventing the annex instances from accidentally running forever. As noted in the example output above, you can specify a lease duration (in decimal hours) with the -duration flag.

If you need to adjust the lease for a particular annex, you may do so by specifying an annex name and a duration, but not a count. When you do so, the new duration is set starting at the current time. For example, if you'd like "MyFirstAnnex" to expire eight hours from now:

$ condor_annex -annex-name MyFirstAnnex -duration 8
Lease updated.

idle time

By default, condor_annex will configure your annex's instances to terminate themselves after being idle for 0.25 hours (fifteen minutes). This is intended to help you conserve money in case of problems or an extended shortage of work. As noted in the example output above, you can specify a max idle time (in decimal hours) with the -idle flag. condor_annex considers an instance idle if it's unclaimed, so it won't get tricked by jobs with long quiescent periods.

multiple annexes

You may have up to fifty (or fewer, depending what else you're doing with your AWS account) differently-named annexes running at the same time. Running condor_annex again with the same annex name before stopping that annex will both add instances to it and change its duration. Only instances which start up after an invocation of condor_annex will respect that invocation's max idle time. That may include instances still starting up from your previous (first) invocation of condor_annex, so be sure your instances have all joined the pool before running condor_annex again with the same annex name if you're changing the max idle time. Each invocation of condor_annex requests a fixed number of instances of a given type; you may specify either or both with each invocation, but neither will change either about the previous request.

Monitor your Annex

You can find out if that instance has successfully joined the pool in the following way.

$ condor_status -annex MyFirstAnnex
slot1@ip-172-31-48-84.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767
slot2@ip-172-31-48-84.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767

              Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
 X86_64/LINUX     2     0       0         2       0          0        0      0
        Total     2     0       0         2       0          0        0      0

This example shows that the annex instance you requested has joined your pool. (The default annex image configures one static slot for each CPU it finds on start-up.)

You can also get a report about the instances which have not joined your pool:

$ condor_annex -annex MyFirstAnnex -status
STATE          COUNT
pending            1
TOTAL              1

Instances not in the pool, grouped by state:
pending i-06928b26786dc7e6e

Multiple Annexes

The following command reports on all annex instance which have joined the pool, regardless of which annex they're from:

$ condor_status -annex
slot1@ip-172-31-48-84.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767
slot2@ip-172-31-48-84.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767
slot1@ip-111-48-85-13.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767
slot2@ip-111-48-85-13.ec2.internal  LINUX     X86_64 Unclaimed Idle 0.640 3767

              Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
 X86_64/LINUX     4     0       0         4       0          0        0      0
        Total     4     0       0         4       0          0        0      0

The following command reports about instance which have not joined the pool, regardless of which annex they're from:

$ condor_annex -status
NAME                        TOTAL running
NamelessTestA                   2       2
NamelessTestB                   3       3
NamelessTestC                   1       1

NAME                        STATUS  INSTANCES...
NamelessTestA               running i-075af9ccb40efb162 i-0bc5e90066ed62dd8
NamelessTestB               running i-02e69e85197f249c2 i-0385f59f482ae6a2e i-06191feb755963edd
NamelessTestC               running i-09da89d40cde1f212

The ellipsis in the last column (INSTANCES...) is to indicate that it's a very wide column and may wrap (as it will for NamelessTestB on an 80-column terminal), not that it has been truncated.

Run a Job

Starting in v8.7.1, the default behaviour for an annex instance is to run only jobs submitted by the user who ran the condor_annex command. If you'd like to allow other users to run jobs, list them (separated by commas; be sure to include yourself) as arguments to the -owner flag when you start the instance. If you're creating an annex for general use, use the -no-owner flag to run jobs from anyone.

Also starting in v8.7.1, the default behaviour for an annex instance is to run only jobs which have the MayUseAWS attribute set (to true). To submit a job with MayUseAWS set to true, add +MayUseAWS = TRUE to the submit file somewhere before the queue command. To allow an existing job to run in the annex, use condor_q_edit. For instance, if you'd like cluster 1234 to run on AWS:

$ condor_qedit 1234 "MayUseAWS = TRUE"
Set attribute "MayUseAWS" for 21 matching jobs.

Stop an Annex

The following command shuts HTCondor off on each instance in the annex; if you're using the default annex image, doing so causes each instance to shut itself down.

$ condor_off -annex MyFirstAnnex
Sent "Kill-Daemon" command for "master" to master ip-172-31-48-84.ec2.internal

Multiple Annexes

The following command turns off all annex instances in your pool, regardless of which annex they're from:

$ condor_off -annex
Sent "Kill-Daemon" command for "master" to master ip-172-31-48-84.ec2.internal
Sent "Kill-Daemon" command for "master" to master ip-111-48-85-13.ec2.internal

Advanced Usage

The information is this section is for advanced users and may not apply (or make sense) to everyone. Additional information about advanced usage is available in the the manual.

Configure the Annex

You can customize the configuration of your annex. If you pass the full path to a directory (for example, /home/annex/config.d) to condor_annex using the -config-dir option, condor_annex will copy the files in that directory to the HTCondor config directory on each annex instance. This does not replace the customization that condor_annex is already doing to configure security and tell the annex instances which pool to join; those changes will be laid down on top of (a temporary copy of) the directory you specified before being transferred to the instances.

Spot Fleet

condor_annex can make use of AWS' Spot Fleet to help reduce the cost of your instances. Consult the manual for more information. Note that v8.7.3's implementation of condor_annex -status is buggy and will not properly return information about Spot Fleet -based annexes.