{section: Assumptions}
We assume you already have an AWS account.
{section: Installation and Configuration}
Martin Kandes wrote excellent instructions on installing and configuring =condor_annex=: http://www.t2.ucsd.edu/twiki2/bin/view/UCSDTier2/Condor_annex. For first-time users, we recommend that you use the 'us-east-1' region (and skip the rest of step 4) and that you pick one of the AMIs listed under step 7 (instead of building your own).
Instead of step 10, please obtain the most recent version of =condor_annex= from FIXME.
{section: Start an Annex}
The minimal set of options to start an annex follows:
{term}
$ condor_annex --project-id MyFirstAnnex --instances 3 --expiry "2017-01-20 17:18:19 --instance-types FIXME --image-ids FIXME --spot-prices FIXME"
{endterm}
_[The collector address and the password file path are both extracted from the command environment's HTCondor configuration. The password file is uploaded to a private S3 bucket managed by condor_annex; the location is passed into the custom AMI via the usual instance contextualization methods.]_
This command will return after HTCondor has set up the lease and requested that Amazon start 3 instances of the type(s) you specified.
_[The tool allows you to create a set of default launch configurations, so you don't have to type them in every time, but specifying an example type on the command-line is a much faster way to get started.]_
It will take a few minutes for the annex's slots to show up, but they will be assigned in the next negotiation cycle (which may also take a few minutes), and your jobs will start running shortly after that.
{section: Monitor your Annex}
The following command line will print out the usual =condor_status= information for the annex you specify:
{term}
$ condor_status -annex MyFirstAnnex
ip-172-31-48-84.ec2.internal LINUX X86_64 Claimed Busy 0.640 3767
ip-172-31-54-121.ec2.internal LINUX X86_64 Claimed Busy 0.880 3767
ip-172-31-56-45.ec2.internal LINUX X86_64 Claimed Busy 0.600 3767
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 11 0 11 0 0 0 0 0
Total 11 0 11 0 0 0 0 0
{endterm}
[This is entirely equivalent to condor_status -const 'AnnexName =?= MyFirstAnnex' so it should be easy to implement.]
In this case, all three of the slots you requested have already started to run jobs.
{section: Stop an Annex}
If you're already familiar with the =condor_off= command, you can use it to turn off HTCondor on the annex nodes; the default image is configured so that this will also shut down the machine. To shut down each machine in an annex, use the following command-line:
{term}
$ condor_off -annex MyFirstAnnex
{endterm}
[This is entirely equivalent to condor_off -const 'AnnexName =?= MyFirstAnnex' so it should be easy to implement.]
{section: Why isn't my Annex running jobs?}
If =condor_status -annex= shows an idle machine, you can use =condor_q= in the usual way to help you determine why:
{term}
$ condor_q -rev -machine ip-172-31-48-84.ec2.internal
-- Schedd: submit-3.batlab.org : <128.104.100.22:50001?...
-- Slot: ip-172-31-48-84.ec2.internal : Analyzing matches for 7 Jobs in 1 autoclusters
The Requirements expression for this slot is
( START ) &&
( IsValidCheckpointPlatform )
START is
true
IsValidCheckpointPlatform is
( TARGET.JobUniverse isnt 1 ||
( ( MY.CheckpointPlatform isnt undefined ) &&
( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
( TARGET.NumCkpts == 0 ) ) ) )
This slot defines the following attributes:
CheckpointPlatform = "LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2"
Job 737.0 has the following attributes:
TARGET.JobUniverse = 5
TARGET.NumCkpts = 0
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[1] 1 IsValidCheckpointPlatform
slot8@submit-3.batlab.org: Run analysis summary of 7 jobs.
7 (100.00 %) match both slot and job requirements.
7 match the requirements of this slot.
7 have job requirements that match this slot.
{endterm}
Since the instances all start from the same image, it's unlikely that one instance in an annex will run a job when another won't. However, since the annex may obtain slots from more one instance type, it's possible that a mismatch between your job's requirement(s) and the default slot size will result in this situation. In this example, the analysis indicates that 100% of your jobs would run in the slot. That could mean that you need to wait a few minutes for another negotiation cycle to occur, or it could mean that you have more slots than jobs.
By default, you'll only have fifteen minutes to to analyze an idle instance; once the instance has been idle for that long, it will shut itself down to avoid uselessly spending (more of) your money.