Our goal was to establish an HTCondor pool with 10,000 EC2 instances.  To conserve cash, but still test the pool at the expected scale, we requested Spot instance types in order of increasing cost (that is, starting with a single core), but configured that startd to always advertise 10 static slots.  Thus, with 10,000 instances, we would have a 100,000-slot pool.

The following configuration, excepting the explanatory comments, is identical to the one successfully started > 100,000 12-hour sleep jobs.

We used shared port on the submitter and CCB on the execute nodes to minimize port usage on the submitter.  (We used shared port on the execute nodes to simplify security group configuration.)

{section: general configuration}

I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in =/home/tlmiller/scale-test/install=.

The following was the same for all three non-AWS machines:

=--- /home/tlmiller/scale-test/install/etc/condor_config ---=
{verbatim}
RELEASE_DIR = /home/tlmiller/scale-test/install
LOCAL_DIR = /home/tlmiller/scale-test/install/local
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config

# A lock file unique to this installation of HTCondor.
LOCK = /tmp/condor-lock.0.152549314153479

# My user ID.
CONDOR_IDS = 20015.20015
{endverbatim}

The local config file was empty on all non-AWS machines:

=--- $(LOCAL_CONFIG_FILE) ---=
{verbatim}
{endverbatim}

{section: central manager}

The central manager ran a collector tree and the schedd used to for EC2 jobs.

=--- $(LOCAL_CONFIG_DIR)/00-cm ---=
{verbatim}
# We're the central manager.
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector.
COLLECTOR_NAME = collector
COLLECTOR_HOST = $(CONDOR_HOST):9999

# These numbers came from CMS scale test set-up.  Not clear if they're right.
COLLECTOR_MAX_FILE_DESCRIPTORS = 80000
SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000

# Negotiating once a minute seems like a good balance between responsiveness
# and load on the schedd.
NEGOTIATOR_INTERVAL = 60

# Don't ever preempt; don't even think about preempting, because that's slow.
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False

# There's no need for this, and the negotiator has better things to do
# with its time.
NEGOTIATOR_INFORM_STARTD = False

# Scaling tweaks.  Not known if necessary.
COLLECTOR_QUERY_WORKERS = 16
{endverbatim}

=--- $(LOCAL_CONFIG_DIR)/01-submitter ---=
{verbatim}
DAEMON_LIST = $(DAEMON_LIST), SCHEDD

# Given that we only want 10,000 instances, this is probably larger
# than it really needs to be.
MAX_JOBS_RUNNING = 20000

# This section is probably irrelevant for an EC2-only submitter.
JOB_START_DELAY = 2
JOB_START_COUNT = 50
JOB_STOP_DELAY = 1
JOB_STOP_COUNT = 30

# No UDP over the WAN, please.
SCHEDD_SEND_VACATE_VIA_TCP = True

# I have no idea what this means.
STARTD_SENDS_ALIVES = True

# We probably don't need any of these for the EC2-only submitter.
SHARED_PORT_MAX_WORKERS = 1000
SHADOW_WORKLIFE = 24 * $(HOUR)
SOCKET_LISTEN_BACKLOG = 1024

# Logging!
MAX_DEFAULT_LOG = 100 Mb
EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND
{endverbatim}

I repeated the following section 256 times, incrementing each instance of '10000' by one each time.

=--- $(LOCAL_CONFIG_DIR)/99-collector-tree ---=
{verbatim}
# We picked port 10000 as our base port completely arbitrarily.
COLLECTOR10000 = $(COLLECTOR)
# We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor.  Not sure why.
COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 "
COLLECTOR10000_ARGS = -f -p 10000
DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000
# Useless, but shuts up the master.
COLLECTOR10000_LOG = $(LOG)/10000Log
{endverbatim}

{section: CCB}

We dedicated a host to CCB, worried that it would be too much to expect the collector tree to both collect and broker.  (That did not appear to be the case.)

=--- $(LOCAL_CONFIG_DIR)/02-ccb ---=

{verbatim}
# We're the CCB host.  We run a full collector tree, because we're lazy.
DAEMON_LIST = MASTER, COLLECTOR
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector
COLLECTOR_NAME = collector
COLLECTOR_HOST = $(CONDOR_HOST):9999

# These numbers came from CMS scale test set-up.  Probably not necessary for a CCB-only machine.
COLLECTOR_MAX_FILE_DESCRIPTORS = 80000
SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000

# Scaling tweaks.  Not known if necessary.
COLLECTOR_QUERY_WORKERS = 16

# Collector tree.
CONDOR_VIEW_HOST = 127.0.0.1:9999
{endverbatim}

We used the full collector-tree set-up for the CCB machine.  This certainly didn't help, but probably didn't hurt, and saved me the effort of writing a new config file.

=--- $(LOCAL_CONFIG_DIR)/99-collector-tree) ---=
{verbatim}
# We picked port 10000 as our base port completely arbitrarily.
COLLECTOR10000 = $(COLLECTOR)
# We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor.  Not sure why.
COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 "
COLLECTOR10000_ARGS = -f -p 10000
DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000
# Useless, but shuts up the master.
COLLECTOR10000_LOG = $(LOG)/10000Log
{endverbatim}

{section: submitter}

We dedicated a machine to being the submit node.  To save memory, we used the 32-bit shadow.  (We had to install an additional 32-bit compatibility library in order to run it without a wrapper script.)  Nonetheless, HTCondor required ~68 of 128 GB of RAM to run all 101,000 jobs, 4 GB of which was the schedd by itself.

=--- $(LOCAL_CONFIG_DIR)/01-submitter ---=
{verbatim}
# We're a submitter.
DAEMON_LIST = MASTER, SCHEDD
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector.
COLLECTOR_HOST = e141.chtc.wisc.edu:9999

# Submitter.
MAX_JOBS_RUNNING = 101000
JOB_START_DELAY = 1
JOB_START_COUNT = 100
JOB_STOP_DELAY = 1
JOB_STOP_COUNT = 1000

SCHEDD_SEND_VACATE_VIA_TCP = True
STARTD_SENDS_ALIVES = True

# This almost certainly doesn't need to be twice the SOCKET_LISTEN_BACKLOG,
# but it also almost certainly doesn't hurt for it to be larger than
# necessary.
SHARED_PORT_MAX_WORKERS = 2048

# Why do we set this?
SHADOW_WORKLIFE = 24 * $(HOUR)

# How many connections to the schedd's socket-forwarding socket do we
# allow to stack up?  (Also affects how many connections to the shared
# port daemon we allow to stack up.)  This has to be a large enough number
# to handle outbound connection burstiness -- CCB used to split the reverse
# connections over many ports, but with shared port in the picture, that
# no longer happens.
SOCKET_LISTEN_BACKLOG = 1024

# This is huge.  If this is too high, the shared port daemon will be
# forced to drop reversed CCB connections from the startds on the floor.
# Setting this to half of SOCKET_LISTEN_BACKLOG seemed to work well.
MAX_PENDING_STARTD_CONTACTS = 512

# One FD per job for talking to the shadows,
# plus one FD per job for talking to the startd/starter,
# plus 1024 FDs for miscellany.
SCHEDD_MAX_FILE_DESCRIPTORS = 201024

# This is probably massive overkill, but the shadows won't be using FDs
# in the shared port daemon.
SHARED_PORT_MAX_FILE_DESCRIPTORS = 101024

# Defaults to 100,000; I wanted a few more in the queue just to show
# we could.
MAX_JOBS_PER_OWNER = 200000
{endverbatim}

{section: execute node}

I started from the _condor_annex_ base image, removed HTCondor, and installed the version under test from RPMs.  I did not change =/etc/condor/condor_config=.

=--- /etc/condor/config.d/annex ---=
{verbatim}
DAEMON_LIST = MASTER, STARTD

# Probably not necessary for production.
STARTD_DEBUG = D_SECURITY D_NETWORK

SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_PASSWORD_FILE = /etc/condor/password_file
ALLOW_OWNER = $(ALLOW_OWNER) condor_pool@*/*
ALLOW_WRITE = $(ALLOW_WRITE) condor_pool@*/*

# This in effect gives the pool 15 minutes to claim this startd
# before shutting down.  This wasn't enough in a few cases, but
# that could be avoided by running more than one schedd or starting
# fewer jobs at the same time.  It also causes this instance to
# suicide if the pool runs of jobs, which was the real main point.
STARTD_NOCLAIM_SHUTDOWN = 900
# Intended to shut the master down when the startd goes away for
# more than sixty seconds.  Also shuts the master down if the
# startd hasn't been started in sixty seconds.
MASTER.DAEMON_SHUTDOWN_FAST = ( STARTD_StartTime == 0 ) && ((time() - DaemonStartTime) > 60)
# The script the shuts the instance down when HTCondor exits.
MASTER_SHUTDOWN_SHUTDOWN = /etc/condor/shutdown
# Because we can, mostly.
ENCRYPT_EXECUTE_DIRECTORY = TRUE

# Take the last quad of the instance's IP address and add it to
# the base port number to determine which child collector and CCB
# collector to use.  Note that PORTSTR is in the ClassAd language;
# the indirection is required because of the way the HTCondor
# config language parses $INT().
PORTSTR = 10000 + int(split( "$(IP_ADDRESS)", "." )[3])
PORTNO = $INT(PORTSTR)
COLLECTOR_HOST = e141.chtc.wisc.edu:$(PORTNO)
CCB_ADDRESS = e142.chtc.wisc.edu:$(PORTNO)

# Advertise the instance ID.
STARTD_ATTRS = $(STARTD_ATTRS) EC2InstanceID
NUM_CPUS = 10
CLAIM_WORKLIFE = 24 * $(HOUR)
{endverbatim}

=--- /etc/condor/shutdown ---=
{verbatim}
#!/bin/sh
/sbin/shutdown -h now
{endverbatim}

=--- /etc/rc.local ---=
{verbatim}
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.

touch /var/lock/subsys/local

# The above is from the stock OS installation.

# Create ec2.local, with some information that may be of interest to us.
echo EC2PublicIP = $(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4) >> /etc/condor/config.d/ec2.local
echo EC2InstanceID = \"$(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/instance-id)\" >> /etc/condor/config.d/ec2.local

# HTCondor shouldn't be started by default, but if it is, be sure to fully reconfigure it.
service condor restart

# Until #5590 becomes available, we have to do this to actually turn the HTCondor master shutdown script on.
while ! condor_set_shutdown -exec shutdown; do sleep 1; done
{endverbatim}