This document describes an HTCondor configuration known to scale to 5000 EC2 nodes.  (It may also assist in constructing other pools across a wide-area, or otherwise high-latency, network.)  We present the configuration as a series scalability problems and solutions, in order of severity.  This document generally refers to the 7.8 and 7.9 release series.

Before we begin, however, a few recommendations (especially for those readers constructing new pools):

*: Use different machines for the central manager and submit nodes.
*: Expand slowly; this will allow you tackle one problem at a time.
*: If possible, use dummy jobs for testing – jobs which behave something like your intended workload, but which don't produce any results you care about.  This allows you to throw away jobs if doing so becomes convenient, as it might if you need to reconfigure something to improve its scalability.  Dummy jobs may simply be real jobs whose results you have already obtained.

{section: Tiered Collectors}

Because the collector blocks on I/O, the number of nodes to which it can promptly respond is limited by the latencies of its connections to them.  Stronger security methods (pool password, Kerberos, SSL, and GSI) require more round-trips between the collector and the startd to establish trust, and may thus intensify this bottleneck.  Using CCB (to work around firewalls or private IP addresses) also increases collector load, especially when also using stronger security methods.

However, an HTCondor pool can use more than a single collector.  Instead, a number of secondary collectors communicate with the startds and forward the results to the primary collector, which all the other HTCondor tools and daemons use (as normal).  This allows HTCondor to perform multiple simultaneous I/O operations (and overlap them with computation) by using the operating system's whole-process scheduler.

To implement this tiered collector set up, follow {wiki: HowToConfigCollectors this recipe}.  If you're already using the shared port service, you should disable it for the collector; it can cause multiple problems.  The easiest way to do this is to turn it off entirely on the central manager (which should not be the same machine as the submit node); it may also explicitly disabled for the collector by setting *COLLECTOR.USE_SHARED_PORT* to *FALSE*.

To reduce confusion, you may also want to configure your execute nodes so that all of its HTCondor daemons connect to the same secondary collector.  (This has the added benefit that reconfiguring an execute node won't change to which collector it reports.)  One of way of doing this is by randomly choosing the *COLLECTOR_HOST* during the boot process.  If you don't set it in your other configuration files, you can simply create a file in the config.d directory (specified by *LOCAL_CONFIG_DIR*, usually /etc/condor/config.d) which sets it.

{section: Shared Port Service}

In its default configuration, HTCondor acquires two ports on the submit node for the lifetime of each running job, one inbound and one outbound.  Although this limits the maximum number of running jobs per submit node (to about 23000), in practice, the firewall tends to become a problem before then.  You can reduce the requirement for inbound ports to the schedd to 1 by using the shared port service, a multiplexer.

You can eliminate the requirement for outbound ports by using CCB in conjunction with the shared port service; doing so converts outbound connections from the submit node to the execute node into inbound connections from the execute node to the shared port on the submit node.  However, using CCB increases the load on the collector(s).

Additionally, using CCB in conjunction with the shared port service reduces the total socket on the central manager, as CCB routes all communications with the execute node through its single shared port.

Using the shared port service may mean that the scaling limit of this configuration is RAM on the submit node: each running job requires a shadow, which in the latest version of HTCondor, uses roughly 1200KB.  (Or about 950KB for a 32-bit installation of HTCondor.)

To use the shared port service, set *USE_SHARED_PORT* to *TRUE* on the submit and execute nodes.  You should not use the shared port service on the central manager; this can cause multiple problems.  You must also add the shared port service to the *DAEMON_LIST*, for instance by setting it to *$(DAEMON_LIST), SHARED_PORT*.

To enable CCB, set *CCB_ADDRESS* to *$(COLLECTOR_HOST)* on the execute nodes.  Do not enable CCB on the central manager (for the collector); this can cause multiple problems, especially in conjunction with the shared port service.

{section: TCP Updates}

By default, HTCondor uses UDP to maintain its pool membership list.  On a LAN, doing so is cheaper and very nearly as reliable.  On a WAN, it becomes much more likely that HTCondor will mistakenly think an execute node has gone away.  Doing so is expensive (and confusing), so we recommend using TCP updates.

To enable TCP updates, set *UPDATE_COLLECTOR_WITH_TCP* to *TRUE* on the execute nodes.

{section: Claim ID Security}

By default, HTCondor creates a secure session from scratch for each daemon-to-daemon connection.  However, it's possible to use an existing shared secret (the claim ID) to jump-start the security negotiation in some cases, eliminating a few round-trips over the WAN.

To use claim IDs for security, set *SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION* to *TRUE* for the submit and execute nodes.  On the execute node, you must also set *ALLOW_DAEMON* to *submit-side@matchsession/<x.y.z.w>*, where *x.y.z.w* is the address of your submit node.

{section: Increase File Descriptor Limits}

You may need to increase the total number of kernel-level file descriptors on the submit node.  Each job gets its own shadow, each of which needs dozens of file descriptors to open all of its libraries.

You may also need to increase the per-process limit on file descriptors on both the central manager (for the collectors) and the submit node (for the schedd).

To increase the total number of system-wide file descriptors, echo the new value to _/proc/sys/file-max_.  (You may cat this file for the current value.)  For most distributions, appending *fs.file-max = <number>* to _/etc/sysctl.conf_ will make this change persist across reboots.  A million should be more than enough.

If HTCondor is started as root, setting *[SCHEDD|COLLECTOR]_MAX_FILE_DESCRIPTORS* can increase how many file descriptors those HTCondor daemons can use.  Twenty thousand should suffice.

{section: Turn Off Match Notification}

In HTCondor's default configuration, the negotiator notifies a startd when a job matches it.  This can slow job dispatch for the entire pool if the startd has vanished but HTCondor hasn't noticed yet, because the notification has a blocking a time out.  For pools using spot instances (which may frequently vanish without warning), we recommend turning off match notification.

To turn off match notification, set *NEGOTIATOR_INFORM_STARTD* to *FALSE* for the central manager.

{section: Manage Disk Space}

HTCondor itself doesn't use much disk, but it's easy to run out when running thousands of jobs.  This obviously causes HTCondor problems, but this problem is usually clearly reported in the logs.

{section: Other Thoughts}

*: Adding an execute node to the pool requires the brief but full attention of one of your secondary collectors, so there's a limit to how many nodes you can add at the same time without the congestion causing the existing nodes grief.  That is, the advice about “growing slowly” above applies even if you've shown that your current configuration will work for size to which you're growing the pool.
*: Likewise, if you're using spot instances to build your pool, considering using a less aggressive bidding strategy to minimize the amount of execute node churn.  Adding a node is expensive, and a terminated node's leases on various resources may not expire for quite some time (by default, 15 minutes).
*: Like execute node churn, but to a lesser degree, job churn may also cause your pool grief.