HTCondorWiki: Chtc Scale Test Setup

Page History

Our goal was to establish an HTCondor pool with 10,000 EC2 instances. To conserve cash, but still test the pool at the expected scale, we requested Spot instance types in order of increasing cost (that is, starting with a single core), but configured that startd to always advertise 10 static slots. Thus, with 10,000 instances, we would have a 100,000-slot pool.

The following configuration, excepting the explanatory comments, is identical to the one successfully started 100,000 12-hour sleep jobs.

general configuration

I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in /home/tlmiller/scale-test/install.

The following was the same for all three non-AWS machines:

--- /home/tlmiller/scale-test/install/etc/condor_config ---

RELEASE_DIR = /home/tlmiller/scale-test/install
LOCAL_DIR = /home/tlmiller/scale-test/install/local
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config

# A lock file unique to this installation of HTCondor.
LOCK = /tmp/condor-lock.0.152549314153479

# My user ID.
CONDOR_IDS = 20015.20015

The local config file was empty on all non-AWS machines:

--- $(LOCAL_CONFIG_FILE) ---

central manager

The central manager ran a collector tree and the schedd used to for EC2 jobs.

--- $(LOCAL_CONFIG_DIR)/00-cm ---

# We're the central manager.
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector.
COLLECTOR_NAME = collector
COLLECTOR_HOST = $(CONDOR_HOST):9999

# These numbers came from CMS scale test set-up.  Not clear if they're right.
COLLECTOR_MAX_FILE_DESCRIPTORS = 80000
SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000

# Negotiating once a minute seems like a good balance between responsiveness
# and load on the schedd.
NEGOTIATOR_INTERVAL = 60

# Don't ever preempt; don't even think about preempting, because that's slow.
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False

# There's no need for this, and the negotiator has better things to do
# with its time.
NEGOTIATOR_INFORM_STARTD = False

# Scaling tweaks.  Not known if necessary.
COLLECTOR_QUERY_WORKERS = 16

--- $(LOCAL_CONFIG_DIR)/01-submitter ---

DAEMON_LIST = $(DAEMON_LIST), SCHEDD

# Given that we only want 10,000 instances, this is probably larger
# than it really needs to be.
MAX_JOBS_RUNNING = 20000

# This section is probably irrelevant for an EC2-only submitter.
JOB_START_DELAY = 2
JOB_START_COUNT = 50
JOB_STOP_DELAY = 1
JOB_STOP_COUNT = 30

# No UDP over the WAN, please.
SCHEDD_SEND_VACATE_VIA_TCP = True

# I have no idea what this means.
STARTD_SENDS_ALIVES = True

# We probably don't need any of these for the EC2-only submitter.
SHARED_PORT_MAX_WORKERS = 1000
SHADOW_WORKLIFE = 24 * $(HOUR)
SOCKET_LISTEN_BACKLOG = 1024

# Logging!
MAX_DEFAULT_LOG = 100 Mb
EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND

We repeated the following section 256 times.

--- $(LOCAL_CONFIG_DIR)/99-collector-tree ---

# We picked port 10000 as our base port completely arbitrarily.
COLLECTOR10000 = $(COLLECTOR)
# We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor.  Not sure why.
COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 "
COLLECTOR10000_ARGS = -f -p 10000
DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000
# Useless, but shuts up the master.
COLLECTOR10000_LOG = $(LOG)/10000Log

CCB

We dedicated a host to CCB, worried that it would be too much to expect the collector tree to both collect and broker. (That did not appear to be the case.)

--- $(LOCAL_CONFIG_DIR)/02-ccb ---

# We're the CCB host.  We run a full collector tree, because we're lazy.
DAEMON_LIST = MASTER, COLLECTOR
CONDOR_HOST = $(FULL_HOSTNAME)

# Shared-nothing.
UID_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)

# Security.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file
ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1

# Primary collector
COLLECTOR_NAME = collector
COLLECTOR_HOST = $(CONDOR_HOST):9999

# These numbers came from CMS scale test set-up.  Probably not necessary for a CCB-only machine.
COLLECTOR_MAX_FILE_DESCRIPTORS = 80000
SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000

# Scaling tweaks.  Not known if necessary.
COLLECTOR_QUERY_WORKERS = 16

# Collector tree.
CONDOR_VIEW_HOST = 127.0.0.1:9999

We used the full collector-tree set-up for the CCB machine. This certainly didn't help, but probably didn't hurt, and saved me the effort of writing a new config file.

--- $(LOCAL_CONFIG_DIR)/99-collector-tree) ---

# We picked port 10000 as our base port completely arbitrarily.
COLLECTOR10000 = $(COLLECTOR)
# We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor.  Not sure why.
COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 "
COLLECTOR10000_ARGS = -f -p 10000
DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000
# Useless, but shuts up the master.
COLLECTOR10000_LOG = $(LOG)/10000Log