{section: Assumptions} We assume you already have an AWS account. {section: Installation and Configuration} This will involve a lot of copying and pasting. :( {subsection: Install a personal Condor} See the instructions for CreatingPersonalHtcondor, but use the following tarball instead: FIXME. {subsection: Configure it to use a pool password.} The =/scratch/condor84= below refers to the "RELEASE_DIR" from when you installed the personal Condor; the =/scratch/local/condor84= refers to the "LOCAL_DIR". Add the following lines to the HTCondor configuration to enable the pool password method: {file: /scratch/local/condor84/condor_config.local} SEC_PASSWORD_FILE = /scratch/condor84/etc/condor_pool_password SEC_DAEMON_INTEGRITY = REQUIRED SEC_DAEMON_AUTHENTICATION = REQUIRED SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD SEC_NEGOTIATOR_INTEGRITY = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD ALLOW_DAEMON = condor_pool@* {endfile} You must also run the following command, which prompt you to enter a password: {term} $ condor_store_cred -c add {endterm} (For more details, see {wiki: HowToEnablePoolPassword}.} {subsection: Prepare your EC2 account} You will need to provide HTCondor with an access key/secret key pair. For security reasons, you specify the location of a file containing the secret key instead of specifying the secret key directly; the same goes for the access key. If you don't already have these keys, you can create new pair from the AWS web console; the process varies depending on which kind of account you have. (FIXME: (link to) Instructions for the root account.) You'll need access to a private S3 bucket. (FIXME: (link to) Instructions for creating private bucket.) To avoid having to upload it every time, the annex assumes that the Lambda function its needs already exists and is configured to run as a role with the required permissons. We've provided a CloudFormation template that will create and configure the Lambda function for you [FIXME: where?]. Instructions follow for readers who haven't created a stack from a template file before. After logging into the AWS web console, do the following for each region you intend to use (start with us-east-1, since that has the example AMI): 1: Switch to the region. (The second drop-down box in from the upper right.) 1: Switch to CloudFormation. (In the _Services_ menu, under _Management_.) 1: Click the "Create Stack" button. 1: Upload the template using the "Browse..." button 1: Click the "Next" button. 1: Name the stack; "HTCondorLeaseImplementation" is a good name. 1: Click the "Next" button. (You don't to change anything on the options screen.) 1: Check the box next to "I acknowledge" (down near the bottom) and click the "Create" button (where the "Next" button was). 1: AWS should return the list of stacks; select the one you just created and select the "outputs" tab. 1: Copy the long string labelled "LeaseFunctionARN"; you'll need it to configure _condor_annex_. It may take some time for that string to appear (you may need to reload the page, as well.) Wait the stack to enter the 'CREATE_COMPLETE' state before using the LeaseFunctionARN (see below). For the same reason, you'll have to create a role for the annex instances, so they (but nobody else) can access the private S3 bucket. [FIXME: This should probably just be CF parameter?] Use the =generate-role= script to create a CloudFormation template: {term} $ generate-role privateBucketName config.targ.gz > role.json {endterm} Create a stack by uploading =role.json=; its output will be named "InstanceConfigurationProfile", and you'll need its value later. If this account you're using has never created a Spot Fleet, create one now: 1: Switch to the region. (The second drop-down box in from the upper right.) 1: Switch to EC2. (In the _Services_ menu, under _Compute_.) 1: Click on "Spot Requests" in the list on the left (under _INSTANCES_). 1: Click the "Request Spot Instances" button. 1: Click the "Next" button. 1: [FIXME: automagic creating the IAM Fleet Role]. You'll also need a security group that allows HTCondor (and SSH, just in case) through the firewall: 1: Click on "Security Groups" (under _NETWORK & SECURITY_). 1: Click the "Create Security Group" button. 1: Enter a name; "HTCondorAnnexSG" is a good one. 1: Enter a description; "Allows SSH and HTCondor from anywhere" would be accurate. 1: Make sure that the "Inbound" tab (the default) is selected. 1: Click the "Add rule" button. Change the left-most drop-down box from "Custom TCP Rule" to "SSH"; change the right-most drop-down box from "Custom" to "Anywhere". (This is less secure than it could be, but these instructions don't have room for a discussion about that.) 1: Click the "Add rule" button again. This time, change the second text box to read "9618" (its column is labelled "Port Range"); also change the right-most drop-down box from "Custom" to "Anywhere". 1: Click the "Create" button. 1: You'll now see a list of security groups. The second column is the group name; find the name you entered when you created the group ("HTCondorAnnexSG") and record its security group ID (the column to the left). {subsection: Configure condor_annex} These instructions use an image published by the HTCondor developers to help people get started. The image's OS is Amazon Linux (based on CentOS 6). The example =config.json= [FIXME: Where?] uses that image and generates slots with 1 CPU and 2 GB of RAM using whatever instance type(s) happen to be cheapest at the time. If you want to tweak those values, read the AWS docs: http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_SpotFleetRequestConfigData.html. The example =config.json= bids the on-demand prices for its instance types. See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html for more information about Spot prices and bidding. _[We think we can automate the process of configuring HTCondor and its image down to just installed a special "EC2" RPM but the last attempt had a bug.]_ Add the following lines to the HTCondor configuration: {file: /scratch/local/condor84/condor_config.local} ANNEX_DEFAULT_ACCESS_KEY_FILE = <path-to-access-key-file> ANNEX_DEFAULT_SECRET_KEY_FILE = <path-to-secret-key-file> ANNEX_DEFAULT_CONFIG_FILE = <path-to-default-config-file> ANNEX_DEFAULT_LEASE_ARN = <LeaseFunctionARN> ANNEX_DEFAULT_IAM_FLEET_ROLE_ARN = <IamFleetRoleARN> ANNEX_DEFAULT_SECURITY_GROUP_ID = <SecurityGroupID> ANNEX_DEFAULT_INSTANCE_ROLE_ARN = <InstanceRoleARN> {endfile} _[This file is a lie. However, it shouldn't be hard to implement. It's just not wise, as written, in the long term.]_ {section: Start an Annex} The minimal set of options to start an annex follows: {term} $ condor_annex -name MyFirstAnnex -slots 3 -duration 24h {endterm} _[The collector address and pool password will be obtained from configuration; previous steps configured the default AMI and JSON; the latter will specify_ count _in units of 1 core and 2GB of RAM (for the instance types which exist at the time of release). We also need to write code to generate and upload the configuration corresponding to the collector and password file in question.]_ This command will return after HTCondor has set up the 24-hour lease and requested that Amazon start 3 slots, each with one CPU and 2 GB of RAM. It may take a few minutes for the annex's slots to show up, but they will be assigned in the next negotiation cycle (which may also take a few minutes), and your jobs will start running shortly after that. _[We don't have to do anything special with respect to running jobs (and only the user's jobs) on the annex, since these instructions specify a personal HTCondor.]_ {section: Monitor your Annex} The following command line will print out the usual =condor_status= information for the annex you specify: {term} $ condor_status -annex MyFirstAnnex ip-172-31-48-84.ec2.internal LINUX X86_64 Claimed Busy 0.640 3767 ip-172-31-54-121.ec2.internal LINUX X86_64 Claimed Busy 0.880 3767 ip-172-31-56-45.ec2.internal LINUX X86_64 Claimed Busy 0.600 3767 Total Owner Claimed Unclaimed Matched Preempting Backfill Drain X86_64/LINUX 11 0 11 0 0 0 0 0 Total 11 0 11 0 0 0 0 0 {endterm} <html><i>[This is entirely equivalent to </i><tt>condor_status -const 'AnnexName =?= MyFirstAnnex'</tt><i> so it should be easy to implement.]</i></html> In this case, all three of the slots you requested have already started to run jobs. {section: Stop an Annex} If you're already familiar with the =condor_off= command, you can use it to turn off HTCondor on the annex nodes; the default image is configured so that this will also shut down the machine. To shut down each machine in an annex, use the following command-line: {term} $ condor_off -annex MyFirstAnnex {endterm} <html><i>[This is entirely equivalent to </i><tt>condor_off -const 'AnnexName =?= MyFirstAnnex'</tt><i> so it should be easy to implement.]</i></html> {section: Why isn't my Annex running jobs?} If =condor_status -annex= shows an idle machine, you can use =condor_q= in the usual way to help you determine why: {term} $ condor_q -rev -machine ip-172-31-48-84.ec2.internal -- Schedd: submit-3.batlab.org : < -- Slot: ip-172-31-48-84.ec2.internal : Analyzing matches for 7 Jobs in 1 autoclusters The Requirements expression for this slot is ( START ) && ( IsValidCheckpointPlatform ) START is true IsValidCheckpointPlatform is ( TARGET.JobUniverse isnt 1 || ( ( MY.CheckpointPlatform isnt undefined ) && ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) ) This slot defines the following attributes: CheckpointPlatform = "LINUX X86_64 2.6.x normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2" Job 737.0 has the following attributes: TARGET.JobUniverse = 5 TARGET.NumCkpts = 0 The Requirements expression for this slot reduces to these conditions: Clusters Step Matched Condition ----- -------- --------- [1] 1 IsValidCheckpointPlatform slot8@submit-3.batlab.org: Run analysis summary of 7 jobs. 7 (100.00 %) match both slot and job requirements. 7 match the requirements of this slot. 7 have job requirements that match this slot. {endterm} Since the instances all start from the same image, it's unlikely that one instance in an annex will run a job when another won't. However, since the annex may obtain slots from more one instance type, it's possible that a mismatch between your job's requirement(s) and the default slot size will result in this situation. In this example, the analysis indicates that 100% of your jobs would run in the slot. That could mean that you need to wait a few minutes for another negotiation cycle to occur, or it could mean that you have more slots than jobs. By default, you'll only have fifteen minutes to to analyze an idle instance; once the instance has been idle for that long, it will shut itself down to avoid uselessly spending (more of) your money.