Page History
The following configuration, excepting the explanatory comments, is identical to the one successfully started 100,000 12-hour sleep jobs.
general configuration
I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in /home/tlmiller/scale-test/install
.
The following was the same for all three non-AWS machines:
--- /home/tlmiller/scale-test/install/etc/condor_config ---
RELEASE_DIR = /home/tlmiller/scale-test/install LOCAL_DIR = /home/tlmiller/scale-test/install/local LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config # A lock file unique to this installation of HTCondor. LOCK = /tmp/condor-lock.0.152549314153479 # My user ID. CONDOR_IDS = 20015.20015
The local config file was empty on all non-AWS machines:
--- $(LOCAL_CONFIG_FILE) ---
central manager
The central manager ran a collector tree and the schedd used to for EC2 jobs.
--- $(LOCAL_CONFIG_DIR)/00-cm ---
# We're the central manager. DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR CONDOR_HOST = $(FULL_HOSTNAME) # Shared-nothing. UID_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) # Security. SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1 # Primary collector. COLLECTOR_NAME = collector COLLECTOR_HOST = $(CONDOR_HOST):9999 # These numbers came from CMS scale test set-up. Not clear if they're right. COLLECTOR_MAX_FILE_DESCRIPTORS = 80000 SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000 # Negotiating once a minute seems like a good balance between responsiveness # and load on the schedd. NEGOTIATOR_INTERVAL = 60 # Don't ever preempt; don't even think about preempting, because that's slow. PREEMPTION_REQUIREMENTS = False NEGOTIATOR_CONSIDER_PREEMPTION = False # There's no need for this, and the negotiator has better things to do # with its time. NEGOTIATOR_INFORM_STARTD = False # Scaling tweaks. Not known if necessary. COLLECTOR_QUERY_WORKERS = 16
--- $(LOCAL_CONFIG_DIR)/01-submitter ---
DAEMON_LIST = $(DAEMON_LIST), SCHEDD # Given that we only want 10,000 instances, this is probably larger # than it really needs to be. MAX_JOBS_RUNNING = 20000 # This section is probably irrelevant for an EC2-only submitter. JOB_START_DELAY = 2 JOB_START_COUNT = 50 JOB_STOP_DELAY = 1 JOB_STOP_COUNT = 30 # No UDP over the WAN, please. SCHEDD_SEND_VACATE_VIA_TCP = True # I have no idea what this means. STARTD_SENDS_ALIVES = True # We probably don't need any of these for the EC2-only submitter. SHARED_PORT_MAX_WORKERS = 1000 SHADOW_WORKLIFE = 24 * $(HOUR) SOCKET_LISTEN_BACKLOG = 1024 # Logging! MAX_DEFAULT_LOG = 100 Mb EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND
We repeated the following section 256 times.
--- $(LOCAL_CONFIG_DIR)/99-collector-tree ---
# We picked port 10000 as our base port completely arbitrarily. COLLECTOR10000 = $(COLLECTOR) # We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor. Not sure why. COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 " COLLECTOR10000_ARGS = -f -p 10000 DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000 # Useless, but shuts up the master. COLLECTOR10000_LOG = $(LOG)/10000Log
CCB
We dedicated a host to CCB, worried that it would be too much to expect the collector tree to both collect and broker. (That did not appear to be the case.)
--- $(LOCAL_CONFIG_DIR)/02-ccb ---
# We're the CCB host. We run a full collector tree, because we're lazy. DAEMON_LIST = MASTER, COLLECTOR CONDOR_HOST = $(FULL_HOSTNAME) # Shared-nothing. UID_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) # Security. SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1 # Primary collector COLLECTOR_NAME = collector COLLECTOR_HOST = $(CONDOR_HOST):9999 # These numbers came from CMS scale test set-up. Probably not necessary for a CCB-only machine. COLLECTOR_MAX_FILE_DESCRIPTORS = 80000 SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000 # Scaling tweaks. Not known if necessary. COLLECTOR_QUERY_WORKERS = 16 # Collector tree. CONDOR_VIEW_HOST = 127.0.0.1:9999
We used the full collector-tree set-up for the CCB machine. This certainly didn't help, but probably didn't hurt, and saved me the effort of writing a new config file.
--- $(LOCAL_CONFIG_DIR)/99-collector-tree) ---
# We picked port 10000 as our base port completely arbitrarily. COLLECTOR10000 = $(COLLECTOR) # We didn't set CONDOR_VIEW_HOST in the base configuration because it caused random delays in the root colletor. Not sure why. COLLECTOR10000_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector10000Log _CONDOR_USE_SHARED_PORT=FALSE _CONDOR_CONDOR_VIEW_HOST=127.0.0.1:9999 " COLLECTOR10000_ARGS = -f -p 10000 DAEMON_LIST = $(DAEMON_LIST), COLLECTOR10000 # Useless, but shuts up the master. COLLECTOR10000_LOG = $(LOG)/10000Log