Page History

Turn Off History

Our goal was to establish an HTCondor pool with 10,000 EC2 instances. To conserve cash, but still test the pool at the expected scale, we requested Spot instance types in order of increasing cost (that is, starting with a single core), but configured that startd to always advertise 10 static slots. Thus, with 10,000 instances, we would have a 100,000-slot pool.

The following configuration, excepting the explanatory comments, is identical to the one successfully started 100,000 12-hour sleep jobs.

general configuration

I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in /home/tlmiller/scale-test/install.

The following was the same for all three non-AWS machines:

--- /home/tlmiller/scale-test/install/etc/condor_config ---

RELEASE_DIR = /home/tlmiller/scale-test/install
LOCAL_DIR = /home/tlmiller/scale-test/install/local
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config

# A lock file unique to this installation of HTCondor.
LOCK = /tmp/condor-lock.0.152549314153479

# My user ID.
CONDOR_IDS = 20015.20015

The local config file was empty on all non-AWS machines:

--- $(LOCAL_CONFIG_FILE) ---

central manager

The central manager ran a collector tree and the schedd used to for EC2 jobs.

--- $(LOCAL_CONFIG_DIR)/00-cm ---

# We're the central manager.
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector.
COLLECTOR_NAME = collector
COLLECTOR_HOST = $(CONDOR_HOST):9999

# No idea where these numbers came from.
COLLECTOR_MAX_FILE_DESCRIPTORS = 80000
SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000

# Negotiating once a minute seems like a good balance between responsiveness
# and load on the schedd.
NEGOTIATOR_INTERVAL = 60

# Don't ever preempt; don't even think about preempting, because that's slow.
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False

# There's no need for this, and the negotiator has better things to do
# with its time.
NEGOTIATOR_INFORM_STARTD = False

# Scaling tweaks.  Not known if necessary.
COLLECTOR_QUERY_WORKERS = 16

--- $(LOCAL_CONFIG_DIR)/01-submitter ---

DAEMON_LIST = $(DAEMON_LIST), SCHEDD

# Given that we only want 10,000 instances, this is probably larger
# than it really needs to be.
MAX_JOBS_RUNNING = 20000

# This section is probably irrelevant for an EC2-only submitter.
JOB_START_DELAY = 2
JOB_START_COUNT = 50
JOB_STOP_DELAY = 1
JOB_STOP_COUNT = 30

# No UDP over the WAN, please.
SCHEDD_SEND_VACATE_VIA_TCP = True

# I have no idea waht this means.
STARTD_SENDS_ALIVES = True

# We probably don't need any of these for the EC2-only submitter.
SHARED_PORT_MAX_WORKERS = 1000
SHADOW_WORKLIFE = 24 * $(HOUR)
SOCKET_LISTEN_BACKLOG = 1024

# Logging!
MAX_DEFAULT_LOG = 100 Mb
EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND