Page History
The following configuration, excepting the explanatory comments, is identical to the one successfully started 100,000 12-hour sleep jobs.
general configuration
I installed "personal condors" from the tarball on e141, e142, and e143.chtc.wisc.edu, in /home/tlmiller/scale-test/install
.
The following was the same for all three non-AWS machines:
--- /home/tlmiller/scale-test/install/etc/condor_config ---
RELEASE_DIR = /home/tlmiller/scale-test/install LOCAL_DIR = /home/tlmiller/scale-test/install/local LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config # A lock file unique to this installation of HTCondor. LOCK = /tmp/condor-lock.0.152549314153479 # My user ID. CONDOR_IDS = 20015.20015
The local config file was empty on all non-AWS machines:
--- $(LOCAL_CONFIG_FILE) ---
central manager
The central manager ran a collector tree and the schedd used to for EC2 jobs.
--- $(LOCAL_CONFIG_DIR)/00-cm ---
# We're the central manager. DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR CONDOR_HOST = $(FULL_HOSTNAME) # Shared-nothing. UID_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) # Security. SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_AUTHENTICATION_METHODS = FS, PASSWORD SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE SEC_PASSWORD_FILE = $(LOCAL_DIR)/password_file ALLOW_WRITE = condor_pool@*/* $(FULL_HOSTNAME) $(IP_ADDRESS) 127.0.0.1 # Primary collector. COLLECTOR_NAME = collector COLLECTOR_HOST = $(CONDOR_HOST):9999 # No idea where these numbers came from. COLLECTOR_MAX_FILE_DESCRIPTORS = 80000 SHARED_PORT_MAX_FILE_DESCRIPTORS = 80000 # Negotiating once a minute seems like a good balance between responsiveness # and load on the schedd. NEGOTIATOR_INTERVAL = 60 # Don't ever preempt; don't even think about preempting, because that's slow. PREEMPTION_REQUIREMENTS = False NEGOTIATOR_CONSIDER_PREEMPTION = False # There's no need for this, and the negotiator has better things to do # with its time. NEGOTIATOR_INFORM_STARTD = False # Scaling tweaks. Not known if necessary. COLLECTOR_QUERY_WORKERS = 16
--- $(LOCAL_CONFIG_DIR)/01-submitter ---
DAEMON_LIST = $(DAEMON_LIST), SCHEDD # Given that we only want 10,000 instances, this is probably larger # than it really needs to be. MAX_JOBS_RUNNING = 20000 # This section is probably irrelevant for an EC2-only submitter. JOB_START_DELAY = 2 JOB_START_COUNT = 50 JOB_STOP_DELAY = 1 JOB_STOP_COUNT = 30 # No UDP over the WAN, please. SCHEDD_SEND_VACATE_VIA_TCP = True # I have no idea waht this means. STARTD_SENDS_ALIVES = True # We probably don't need any of these for the EC2-only submitter. SHARED_PORT_MAX_WORKERS = 1000 SHADOW_WORKLIFE = 24 * $(HOUR) SOCKET_LISTEN_BACKLOG = 1024 # Logging! MAX_DEFAULT_LOG = 100 Mb EC2_GAHP_DEBUG = D_PERF_TRACE D_SUB_SECOND