Assumptions

We assume you already have an AWS account.

Installation and Configuration

This will involve a lot of copying and pasting. :(

Install a personal Condor

See the instructions for CreatingPersonalHtcondor, but use the following tarball instead: FIXME.

Configure it to use a pool password.

The /scratch/condor84 below refers to the "RELEASE_DIR" from when you installed the personal Condor; the /scratch/local/condor84 refers to the "LOCAL_DIR".

Add the following lines to the HTCondor configuration to enable the pool password method:

/scratch/local/condor84/condor_config.local
SEC_PASSWORD_FILE = /scratch/condor84/etc/condor_pool_password
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
ALLOW_DAEMON = condor_pool@*

You must also run the following command, which prompt you to enter a password:

$ condor_store_cred -c add

(For more details, see HowToEnablePoolPassword.}

Prepare your EC2 account

You will need to provide HTCondor with an access key/secret key pair. For security reasons, you specify the location of a file containing the secret key instead of specifying the secret key directly; the same goes for the access key. If you don't already have these keys, you can create new pair from the AWS web console; the process varies depending on which kind of account you have. (FIXME: (link to) Instructions for the root account.)

Create a private S3 bucket

You'll need access to a private S3 bucket. (FIXME: (link to) Instructions for creating private bucket.) In the following instructions, we'll call this bucket 'privateBucketName'; replace that string, when you see it, with the actual name of the private bucket you created in this step.

Prepare the lease machineery

To avoid having to upload it every time, the annex assumes that the Lambda function its needs already exists and is configured to run as a role with the required permissons. We've provided a CloudFormation template that will create and configure the Lambda function for you [FIXME: where?]. Instructions follow for readers who haven't created a stack from a template file before. After logging into the AWS web console, do the following for each region you intend to use (you do just 'us-east-1' to start, since that has the example AMI):

  1. Switch to the region. (The second drop-down box in from the upper right.)
  2. Switch to CloudFormation. (In the Services menu, under Management.)
  3. Click the "Create Stack" button.
  4. Upload the template using the "Browse..." button
  5. Click the "Next" button.
  6. Name the stack; "HTCondorLeaseImplementation" is a good name.
  7. Click the "Next" button. (You don't to change anything on the options screen.)
  8. Check the box next to "I acknowledge" (down near the bottom) and click the "Create" button (where the "Next" button was).
  9. AWS should return the list of stacks; select the one you just created and select the "outputs" tab.
  10. Copy the long string labelled "LeaseFunctionARN"; you'll need it to configure condor_annex. It may take some time for that string to appear (you may need to reload the page, as well.) Wait the stack to enter the 'CREATE_COMPLETE' state before using the LeaseFunctionARN (see below).

Prepare the dynamic configuration machinery

For the same reason, you'll have to create a role for the annex instances, so they (but nobody else) can access the private S3 bucket. [FIXME: This should probably just be CF parameter?] Use the generate-role script, distributed FIXME, to create a CloudFormation template:

$ generate-role privateBucketName config.tar.gz > role.json

Create a stack by uploading role.json, but otherwise following the instructions from the previous section; the stack's output will be named "InstanceConfigurationProfile", and you'll need its value later.

Create Spot Fleet role

If this account you're using has never created a Spot Fleet, create one now:

  1. Switch to the region. (The second drop-down box in from the upper right.)
  2. Switch to EC2. (In the Services menu, under Compute.)
  3. Click on "Spot Requests" in the list on the left (under INSTANCES).
  4. Click the "Request Spot Instances" button.
  5. Click the "Next" button.
  6. [FIXME: automagic creating the IAM Fleet Role].

Create Security Group

You'll also need a security group that allows HTCondor (and SSH, just in case) through the firewall:

  1. Click on "Security Groups" (under NETWORK & SECURITY).
  2. Click the "Create Security Group" button.
  3. Enter a name; "HTCondorAnnexSG" is a good one.
  4. Enter a description; "Allows SSH and HTCondor from anywhere" would be accurate.
  5. Make sure that the "Inbound" tab (the default) is selected.
  6. Click the "Add rule" button. Change the left-most drop-down box from "Custom TCP Rule" to "SSH"; change the right-most drop-down box from "Custom" to "Anywhere". (This is less secure than it could be, but these instructions don't have room for a discussion about that.)
  7. Click the "Add rule" button again. This time, change the second text box to read "9618" (its column is labelled "Port Range"); also change the right-most drop-down box from "Custom" to "Anywhere".
  8. Click the "Create" button.
  9. You'll now see a list of security groups. The second column is the group name; find the name you entered when you created the group ("HTCondorAnnexSG") and record its security group ID (the column to the left).

Configure condor_annex

These instructions use an image published by the HTCondor developers to help people get started. The image's OS is Amazon Linux (based on CentOS 6). The example config.json [FIXME: Where?] uses that image and generates slots with 1 CPU and 2 GB of RAM using whatever instance type(s) happen to be cheapest at the time. If you want to tweak those values, read the AWS docs: http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_SpotFleetRequestConfigData.html. The example config.json bids the on-demand prices for its instance types. See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html for more information about Spot prices and bidding.

[We think we can automate the process of configuring HTCondor and its image down to just installed a special "EC2" RPM but the last attempt had a bug.]

Add the following lines to the HTCondor configuration:

/scratch/local/condor84/condor_config.local
ANNEX_DEFAULT_ACCESS_KEY_FILE = <path-to-access-key-file>
ANNEX_DEFAULT_SECRET_KEY_FILE = <path-to-secret-key-file>
ANNEX_DEFAULT_SPOT_FLEET_CONFIG_FILE = <path-to-default-config-file>

ANNEX_DEFAULT_LEASE_ARN = <LeaseFunctionARN>
ANNEX_DEFAULT_IAM_FLEET_ROLE_ARN = <IamFleetRoleARN>
ANNEX_DEFAULT_SECURITY_GROUP_ID = <SecurityGroupID>
ANNEX_DEFAULT_INSTANCE_ROLE_ARN = <InstanceRoleARN>
[This file is a lie. However, it shouldn't be hard to implement. It's just not wise, as written, in the long term.]

Start an Annex

The minimal set of options to start an annex follows:

$ condor_annex -annex-name MyFirstAnnex -slots 3 -duration 24h

[The collector address and pool password will be obtained from configuration; previous steps configured the default AMI and JSON; the latter will specify count in units of 1 core and 2GB of RAM (for the instance types which exist at the time of release). We also need to write code to generate and upload the configuration corresponding to the collector and password file in question.]

This command will return after HTCondor has set up the 24-hour lease and requested that Amazon start 3 slots, each with one CPU and 2 GB of RAM.

It may take a few minutes for the annex's slots to show up, but they will be assigned in the next negotiation cycle (which may also take a few minutes), and your jobs will start running shortly after that.

[We don't have to do anything special with respect to running jobs (and only the user's jobs) on the annex, since these instructions specify a personal HTCondor.]

Monitor your Annex

The following command line will print out the usual condor_status information for the annex you specify:

$ condor_status -annex MyFirstAnnex
ip-172-31-48-84.ec2.internal  LINUX      X86_64 Claimed   Busy      0.640 3767
ip-172-31-54-121.ec2.internal LINUX      X86_64 Claimed   Busy      0.880 3767
ip-172-31-56-45.ec2.internal  LINUX      X86_64 Claimed   Busy      0.600 3767

              Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain
 X86_64/LINUX    11     0      11         0       0          0        0      0
        Total    11     0      11         0       0          0        0      0

[This is entirely equivalent to condor_status -const 'AnnexName =?= MyFirstAnnex' so it should be easy to implement.]
In this case, all three of the slots you requested have already started to run jobs.

Stop an Annex

If you're already familiar with the condor_off command, you can use it to turn off HTCondor on the annex nodes; the default image is configured so that this will also shut down the machine. To shut down each machine in an annex, use the following command-line:

$ condor_off -annex MyFirstAnnex

[This is entirely equivalent to condor_off -const 'AnnexName =?= MyFirstAnnex' so it should be easy to implement.]