HTCondorWiki: How To Manage Large Condor Pools

-{section: How to manage a large Condor pool}
+{section: How to manage a large HTCondor pool}
 
-Known to work with Condor version: 7.0
+Known to work with HTCondor version: 7.0
 
-Condor can handle pools of 10s of thousands of execution slots and job queues of 100s of thousands of jobs.  Depending on the way you deploy it, the workloads that run on it, and the other tasks that share the system with it, you may find that Condor's ability to 'keep up' is limited by memory, processing speed, disk bandwidth, or configurable limits.  The following information should help you determine whether there is a problem, which component is suffering, and what you might be able to do about it.
+HTCondor can handle pools of 10s of thousands of execution slots and job queues of 100s of thousands of jobs.  Depending on the way you deploy it, the workloads that run on it, and the other tasks that share the system with it, you may find that HTCondor's ability to 'keep up' is limited by memory, processing speed, disk bandwidth, or configurable limits.  The following information should help you determine whether there is a problem, which component is suffering, and what you might be able to do about it.
 
-{subsection: Basic Guidelines for Large Condor Pools}
+{subsection: Basic Guidelines for Large HTCondor Pools}
 
-1.   Upgrade to Condor 7 if you are still using something older than that.  It has many scalability improvements.  It is also a good idea to upgrade your configuration based on the defaults that ship with Condor 7, because it also contains updated settings that improve scalability.
+1.   Upgrade to HTCondor 7 if you are still using something older than that.  It has many scalability improvements.  It is also a good idea to upgrade your configuration based on the defaults that ship with HTCondor 7, because it also contains updated settings that improve scalability.
 
 2.   Put central manager (collector + negotiator) on a machine with sufficient memory and 2 cpus/cores primarily dedicated to this service.
 
 3.   If convenient, take advantage of the fact that you can use multiple submit machines.  At least dedicate one machine as a submit machine with few or no other duties.
 
-4.   Under UNIX, increase the number of jobs that a schedd will run simultaneously if you have enough disk bandwidth and memory (see rough estimates in next section).  As of Condor 7.4.0, the default setting for MAX_JOBS_RUNNING should be a reasonable formula that scales with the amount of memory available.  In prior versions, the default was just 200.
+4.   Under UNIX, increase the number of jobs that a schedd will run simultaneously if you have enough disk bandwidth and memory (see rough estimates in next section).  As of HTCondor 7.4.0, the default setting for MAX_JOBS_RUNNING should be a reasonable formula that scales with the amount of memory available.  In prior versions, the default was just 200.
 
 {code}
 MAX_JOBS_RUNNING = 2000
@@ -24,7 +24,7 @@
 
 7.   If running a lot of big standard universe jobs, set up multiple checkpoint servers, rather than doing all checkpointing onto the submit node.
 
-8.   If you are not using strong security (i.e. just host IP authorization) in your Condor pool, then you can turn off security negotiation to reduce overhead:
+8.   If you are not using strong security (i.e. just host IP authorization) in your HTCondor pool, then you can turn off security negotiation to reduce overhead:
 
 {code}
 SEC_DEFAULT_NEGOTIATION = OPTIONAL
@@ -50,9 +50,9 @@
 
 *: Schedd requires a minimum of ~10k RAM per job in the job queue.  For jobs with huge environment values or other big ClassAd attributes, the requirements are larger.
 
-*: Each running job has a condor_shadow process, which requires an additional ~500k RAM.  (Disclaimer: we have some reports that in different environments/configurations, this requirement can be inflated by a factor of 2.)  32-bit Linux may run out of kernel memory even if there is free "high" memory available.  In our experience, with Condor 7.3.0, a 32-bit dedicated submit machine cannot run more than 10,000 jobs simultaneously, because of kernel memory constraints.
+*: Each running job has a condor_shadow process, which requires an additional ~500k RAM.  (Disclaimer: we have some reports that in different environments/configurations, this requirement can be inflated by a factor of 2.)  32-bit Linux may run out of kernel memory even if there is free "high" memory available.  In our experience, with HTCondor 7.3.0, a 32-bit dedicated submit machine cannot run more than 10,000 jobs simultaneously, because of kernel memory constraints.
 
-*: Each vanilla job requires two, occasionally three, network ports on the submit machine.  Standard universe jobs require 5.  In 2.6 Linux, the ephemeral port range is typically 32768 through 61000, so from a single submit machine this limits you to 14000 simultaneously running vanilla jobs.  In Linux, you can increase the ephemeral port range via /proc/sys/net/ipv4/ip_local_port_range.  Note that short-running jobs may require more ports, because a non-negligible number of ports will be consumed in the temporary TIME_WAIT state.  For this reason, the condor manual conservatively recommends 5 * running jobs.  Fortunately, as of Condor 7.5.3, the TIME_WAIT issue with short running jobs is largely gone, due to SHADOW_WORKLIFE.  Also, as of Condor 7.5.0, condor_shared_port can be used to reduce port usage even further.  Port usage per running job is negligible if CCB is used to access the execute nodes; otherwise it is 1 (outgoing) port/job.
+*: Each vanilla job requires two, occasionally three, network ports on the submit machine.  Standard universe jobs require 5.  In 2.6 Linux, the ephemeral port range is typically 32768 through 61000, so from a single submit machine this limits you to 14000 simultaneously running vanilla jobs.  In Linux, you can increase the ephemeral port range via /proc/sys/net/ipv4/ip_local_port_range.  Note that short-running jobs may require more ports, because a non-negligible number of ports will be consumed in the temporary TIME_WAIT state.  For this reason, the HTCondor manual conservatively recommends 5 * running jobs.  Fortunately, as of HTCondor 7.5.3, the TIME_WAIT issue with short running jobs is largely gone, due to SHADOW_WORKLIFE.  Also, as of HTCondor 7.5.0, condor_shared_port can be used to reduce port usage even further.  Port usage per running job is negligible if CCB is used to access the execute nodes; otherwise it is 1 (outgoing) port/job.
 
 Example calculations:
 
@@ -161,9 +161,9 @@
 
 {subsection: Monitoring Health of the Collector}
 
-If the collector can't keep up with the ClassAd updates that it is receiving from the Condor daemons in the pool, and you are using UDP updates (the default) then it will "drop" updates.  The consequence of dropped updates is stale information about the state of the pool and possibly machines appearing to be missing from the pool (depending on how many successive updates are lost).  If you are using TCP updates and the collector cannot keep up, then Condor daemons (e.g. startds) may block/timeout when trying to send udpates.
+If the collector can't keep up with the ClassAd updates that it is receiving from the HTCondor daemons in the pool, and you are using UDP updates (the default) then it will "drop" updates.  The consequence of dropped updates is stale information about the state of the pool and possibly machines appearing to be missing from the pool (depending on how many successive updates are lost).  If you are using TCP updates and the collector cannot keep up, then HTCondor daemons (e.g. startds) may block/timeout when trying to send udpates.
 
-A simple way to see if you have a serious problem with dropped updates is to observe the total number of machines in the pool, from the point of view of the collector ({code}condor_status -total{endcode}).  If this number drops down to less than it should be, and the missing machines are running Condor and otherwise working fine, then the problem may be dropped updates.
+A simple way to see if you have a serious problem with dropped updates is to observe the total number of machines in the pool, from the point of view of the collector ({code}condor_status -total{endcode}).  If this number drops down to less than it should be, and the missing machines are running HTCondor and otherwise working fine, then the problem may be dropped updates.
 
 A more direct way to see if your collector is dropping ClassAd updates is to use the tool {link: http://www.cs.wisc.edu/condor/manual/v7.0/condor_updates_stats.html condor_updates_stats}.  Example:
 
@@ -179,14 +179,14 @@
 
 {endcode}
 
-If your problem is simply that UDP updates are coming in uneven bursts, then the solution is to provide enough UDP buffer space.  You can see whether this is the problem by watching the receive queue on the collector's UDP port (visible through <code>netstat -l</code> under unix).  If it fills up now and then but is otherwise empty, then increasing the buffer size should help.  However, the default in current versions of Condor is 10MB, which is adequate for most large pools that we have seen.  Example:
+If your problem is simply that UDP updates are coming in uneven bursts, then the solution is to provide enough UDP buffer space.  You can see whether this is the problem by watching the receive queue on the collector's UDP port (visible through <code>netstat -l</code> under unix).  If it fills up now and then but is otherwise empty, then increasing the buffer size should help.  However, the default in current versions of HTCondor is 10MB, which is adequate for most large pools that we have seen.  Example:
 
 {code}
 # 20MB
 COLLECTOR_SOCKET_BUFSIZE = 20480000
 {endcode}
 
-See the Condor Manual entry for {link: http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14586 COLLECTOR_SOCKET_BUFSIZE} for additional information on how to make sure the OS is cooperating with the requested buffer size.
+See the HTCondor Manual entry for {link: http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14586 COLLECTOR_SOCKET_BUFSIZE} for additional information on how to make sure the OS is cooperating with the requested buffer size.
 
 If you are using strong authentication in the updates to the collector, this may add a lot of overhead and cause the collector not to scale high enough for very large pools.  One way to deal with that is to have multiple collectors that each serve a portion of your execute nodes.  These collectors would receive updates via strong authentication and then forward the updates to another main collector.  An example of how to set this up is described in {wiki: HowToConfigMulti-TierCollectors How to configure multi-tier collectors}.
 
@@ -198,6 +198,6 @@
 
 {subsection: High Availability}
 
-You can set up redundant collector+negotiator instances, so if the central manager machine goes down, the pool can continue to function.  All of the HAD collectors run all the time, but only one negotiator may run at a time, so the condor_had component ensures that a new instance of the negotiator is started up when the existing one dies.  The main restriction is that the HAD negotiator won't help users who are flocking to the condor pool.  More information about HAD can be found in the {link: http://www.cs.wisc.edu/condor/manual/v7.0/3_10High_Availability.html Condor manual}.
+You can set up redundant collector+negotiator instances, so if the central manager machine goes down, the pool can continue to function.  All of the HAD collectors run all the time, but only one negotiator may run at a time, so the condor_had component ensures that a new instance of the negotiator is started up when the existing one dies.  The main restriction is that the HAD negotiator won't help users who are flocking to the HTCondor pool.  More information about HAD can be found in the {link: http://www.cs.wisc.edu/condor/manual/v7.0/3_10High_Availability.html HTCondor manual}.
 
 Tip: if you do frequent condor_status queries for monitoring, you can direct these to one of your secondary collectors in order to offload work from your primary collector.