Before we begin, however, a few recommendations (especially for those readers constructing new pools):
- Use different machines for the central manager and submit nodes.
- Expand slowly; this will allow you tackle one problem at a time.
- If possible, use dummy jobs for testing – jobs which behave something like your intended workload, but which don't produce any results you care about. This allows you to throw away jobs if doing so becomes convenient, as it might if you need to reconfigure something to improve its scalability. Dummy jobs may simply be real jobs whose results you have already obtained.
Network Connectivity
Because Amazon provides public IP addresses to running instances, you don't need to use CCB. Instead, with the help of the following script, you can use TCP_FORWARDING_HOST, which incurs no additional load on the collector. (Arrange for this script to be run on execute-node start-up.) You will, of course, need to adjust (or create) a security group to allow inbound connections; consider using the shared port service (see below) to simplify this.
cat > /etc/condor/config.d/ec2.local <<EOF EC2PublicIP = $(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4) EC2InstanceID = "$(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/instance-id)" STARTD_EXPRS = \$(STARTD_EXPRS) EC2PublicIP EC2InstanceID EOF chmod a+r /etc/condor/config.d/ec2.local
This script requires that LOCAL_CONFIG_DIR be set to /etc/condor/config.d, the default for many distributions.
Example Configuration
For the execute node(s):
# See the "Network Connectivity" subsection. LOCAL_CONFIG_DIR = /etc/condor/config.d TCP_FORWARDING_HOST = $(EC2PublicIP) # See the "Shared Port" section. # This reduces the number of inbound ports you need to add to your # security group to 1. USE_SHARED_PORT = TRUE DAEMON_LIST = MASTER, STARTD, SHARED_PORT # See the "CCB" section. # CCB_ADDRESS = $(COLLECTOR_HOST) # See the "TCP Updates" section. # TCP updates are more expensive but more reliable over a WAN. UPDATE_COLLECTOR_WITH_TCP = TRUE # See the "Claim ID Security" section. SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE ALLOW_DAEMON = condor_pool@*, submit-side@matchsession/<submitter-ip-address>
For the submit node(s):
# See the "Shared Port" section. # This reduces the number of inbound ports you need to add to your # firewall to 1. USE_SHARED_PORT = TRUE DAEMON_LIST = MASTER, SCHEDD, SHARED_PORT # See the "Claim ID Security" section. SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE # See the "Increase File Descriptor Limits" section. SCHEDD_MAX_FILE_DESCRIPTORS = 20000
For the central manager:
# See the "Tiered Collectors" section. COLLECTOR02 = $(COLLECTOR) COLLECTOR03 = $(COLLECTOR) COLLECTOR04 = $(COLLECTOR) # These port numbers are somewhat arbitrary; edit as required. COLLECTOR02_ARGS = -f -p 10002 COLLECTOR03_ARGS = -f -p 10003 COLLECTOR04_ARGS = -f -p 10004 # The master complains if these aren't set, although they don't work. COLLECTOR02_LOG = $(LOG)/Collector02Log COLLECTOR03_LOG = $(LOG)/Collector03Log COLLECTOR04_LOG = $(LOG)/Collector04Log # Setting the log file in the environment, however, does. COLLECTOR02_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector02Log" COLLECTOR03_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector03Log" COLLECTOR04_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/Collector04Log" DAEMON_LIST = MASTER, NEGOTIATOR, COLLECTOR, COLLECTOR02, COLLECTOR03, COLLECTOR04 CONDOR_VIEW_HOST = $(COLLECTOR_HOST) # See the "Shared Port" section. COLLECTOR.USE_SHARED_PORT = FALSE # See the "Increase File Descriptor Limits" section. COLLECTOR_MAX_FILE_DESCRIPTORS = 20000 # See the "Turn Off Match Notification" section. NEGOTIATOR_INFORM_STARTD = FALSE
Tiered Collectors
Because the collector blocks on I/O, the number of nodes to which it can promptly respond is limited by the latencies of its connections to them. Stronger security methods (pool password, Kerberos, SSL, and GSI) require more round-trips between the collector and the startd to establish trust, and may thus intensify this bottleneck.
However, an HTCondor pool can use more than a single collector. Instead, a number of secondary collectors communicate with the startds and forward the results to the primary collector, which all the other HTCondor tools and daemons use (as normal). This allows HTCondor to perform multiple simultaneous I/O operations (and overlap them with computation) by using the operating system's whole-process scheduler.
The recipe for this tiered collector set up is included in the example configurations above; you may also want to read HowToConfigCollectors. However, the example execute configuration is incomplete: it does not specify the secondary collector to which it should connect. You may use the solution from HowToConfigCollectors:
COLLECTOR_HOST=$RANDOM_CHOICE(collector.hostname:10002,collector.hostname:10003,collector.hostname:10004)
but this may cause confusion, as different daemons may connect to different secondary collectors. To avoid this problem, add script like the following to the start-up sequence of the execute node VMs:
COLLECTOR_PORT=$((($RANDOM % 3) + 10001)) echo "COLLECTOR_HOST = collector.hostname:${COLLECTOR_PORT}" > /etc/condor/config.d/collector-host.local chmod a+r /etc/condor/config.d/collector-host.local
CCB
If you're using an EC2-compatible service which doesn't provide public IPs, you may need to use CCB. (You may also need to use CCB if you don't control the firewall around your VM instances.) Using CCB increases the load on the collector, so you may need to increase the number of secondary collectors in your tiered configuration.
Using CCB conjunction with the shared port service further reduces the port usage of HTCondor; thus, in certain circumstances, it may become useful independent of its ability to work around private IPs and firewalls. See below.
Shared Port
In its default configuration, HTCondor acquires two ports on the submit node for the lifetime of each running job, one inbound and one outbound. Although this limits the maximum number of running jobs per submit node (to about 23000), in practice, the firewall tends to become a problem before then. You can reduce the requirement for inbound ports to the schedd to 1 by using the shared port service, a multiplexer.
You can eliminate the requirement for outbound ports by using CCB in conjunction with the shared port service; doing so converts outbound connections from the submit node to the execute node into inbound connections from the execute node to the shared port on the submit node. However, using CCB increases the load on the collector(s).
Additionally, using CCB in conjunction with the shared port service reduces the total socket on the central manager, as CCB routes all communications with the execute node through its single shared port.
The shared port service may also be of benefit on the execute node when configuring its firewall.
Using the shared port service may mean that the scaling limit of this configuration is RAM on the submit node: each running job requires a shadow, which in the latest version of HTCondor, uses roughly 1200KB. (Or about 950KB for a 32-bit installation of HTCondor.)
The shared port service is enabled in the example configurations above.
TCP Updates
By default, HTCondor uses UDP to maintain its pool membership list. On a LAN, doing so is cheaper and very nearly as reliable. On a WAN, it becomes much more likely that HTCondor will mistakenly think an execute node has gone away. Doing so is expensive (and confusing), so we recommend using TCP updates.
TCP updates are enable in the example configurations above.
Claim ID Security
By default, HTCondor creates a secure session from scratch for each daemon-to-daemon connection. However, it's possible to use an existing shared secret (the claim ID) to jump-start the security negotiation in some cases, eliminating a few round-trips over the WAN.
Claim ID security is enabled in the example configurations above.
Increase File Descriptor Limits
You may need to increase the total number of kernel-level file descriptors on the submit node. Each job gets its own shadow, each of which needs dozens of file descriptors to open all of its libraries.
You may also need to increase the per-process limit on file descriptors on both the central manager (for the collectors) and the submit node (for the schedd).
To increase the total number of system-wide file descriptors, echo the new value to /proc/sys/file-max. (You may cat this file for the current value.) For most distributions, appending fs.file-max = <number> to /etc/sysctl.conf will make this change persist across reboots. A million should be more than enough.
The example configurations above, if started as root, will increase the file descriptor limits for the schedd and the collector.
Increase kernel firewall connection tracking table size
For collectors with a very large number of connections running on Linux machine with a kernel firewall, the size of the connection tracking table in the firewall may become a bottleneck. On many systems, there is a limit of 64k connections. This can be raised by setting a new limit in
/proc/sys/net/netfilter/nf_conntrack_max
Turn Off Match Notification
In HTCondor's default configuration, the negotiator notifies a startd when a job matches it. This can slow job dispatch for the entire pool if the startd has vanished but HTCondor hasn't noticed yet, because the notification has a blocking a time out. For pools using spot instances (which may frequently vanish without warning), we recommend turning off match notification.
Match notification is disable in the example configurations above.
Manage Disk Space
HTCondor itself doesn't use much disk, but it's easy to run out when running thousands of jobs. This obviously causes HTCondor problems, but this problem is usually clearly reported in the logs.
Other Thoughts
- Adding an execute node to the pool requires the brief but full attention of one of your secondary collectors, so there's a limit to how many nodes you can add at the same time without the congestion causing the existing nodes grief. That is, the advice about “growing slowly” above applies even if you've shown that your current configuration will work for size to which you're growing the pool.
- Likewise, if you're using spot instances to build your pool, considering using a less aggressive bidding strategy to minimize the amount of execute node churn. Adding a node is expensive, and a terminated node's leases on various resources may not expire for quite some time (by default, 15 minutes).
- Like execute node churn, but to a lesser degree, job churn may also cause your pool grief.