Our goal was to establish an HTCondor pool with 10,000 EC2 instances. To conserve cash, but still test the pool at the expected scale, we requested Spot instance types in order of increasing cost (that is, starting with a single core), but configured that startd to always advertise 10 static slots. Thus, with 10,000 instances, we would have a 100,000-slot pool. The following configuration, excepting the explanatory comments, is identical to the one successfully started 100,000 12-hour sleep jobs. {section: general configuration} I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in =/home/tlmiller/scale-test/install=. The following was the same for all three non-AWS machines: =--- /home/tlmiller/scale-test/install/etc/condor_config= {verbatim} RELEASE_DIR = /home/tlmiller/scale-test/install LOCAL_DIR = /home/tlmiller/scale-test/install/local LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config # A lock file unique to this installation of HTCondor. LOCK = /tmp/condor-lock.0.152549314153479 # My user ID. CONDOR_IDS = 20015.20015 {endverbatim} The local config file was empty on all non-AWS machines: =--- $(LOCAL_CONFIG_FILE)= {verbatim} {endverbatim} {section: central manager} The central manager ran a collector tree and the schedd used to for EC2 jobs. =--- $(LOCAL_CONFIG_DIR)/00-cm= {verbatim} # We're the central manager. DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR CONDOR_HOST = $(FULL_HOSTNAME) # Shared-nothing. UID_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) # Security. SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1 # Primary collector. COLLECTOR_NAME = collector COLLECTOR_HOST = $(CONDOR_HOST):9999 # No idea where these numbers came from. COLLECTOR_MAX_FILE_DESCRIPTORS = 80000 SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000 # Negotiating once a minute seems like a good balance between responsiveness # and load on the schedd. NEGOTIATOR_INTERVAL = 60 # Don't ever preempt; don't even think about preempting, because that's slow. PREEMPTION_REQUIREMENTS = False NEGOTIATOR_CONSIDER_PREEMPTION = False # There's no need for this, and the negotiator has better things to do # with its time. NEGOTIATOR_INFORM_STARTD = False # Scaling tweaks. Not known if necessary. COLLECTOR_QUERY_WORKERS = 16 {endverbatim} =--- $(LOCAL_CONFIG_DIR)/01-submitter= {verbatim} DAEMON_LIST = $(DAEMON_LIST), SCHEDD # Don't remember why this isn't 10,000. MAX_JOBS_RUNNING = 20000 # This section is probably irrelevant for an EC2-only submitter. JOB_START_DELAY = 2 JOB_START_COUNT = 50 JOB_STOP_DELAY = 1 JOB_STOP_COUNT = 30 # No UDP over the WAN, please. SCHEDD_SEND_VACATE_VIA_TCP = True # I have no idea waht this means. STARTD_SENDS_ALIVES = True # We probably don't need any of these for the EC2-only submitter. SHARED_PORT_MAX_WORKERS = 1000 SHADOW_WORKLIFE = 24 * $(HOUR) SOCKET_LISTEN_BACKLOG = 1024 # Logging! MAX_DEFAULT_LOG = 100 Mb EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND {endverbatim}