HTCondorWiki: How To Config Collectors

 
 Known to work in Condor version: 7.0.1
 
-This is a technique for increasing the scalability of the Condor collector.  This has been found to help scale up glidein pools using GSI authentication in order to scale beyond ~5000 slots to ~12000 slots.  Other strong authentication methods are similarly CPU intensive, so they should also benefit from this technique.  The reason why this is particularly relevant to glidein pools is that these pools typically have shorter lived startds than dedicated pools, so new security sessions need to be established more often.  In the case mentioned where we needed to configure a multi-tier collector to scale beyond 5k slots, the glideins were restarting (unsynchronized) with an average of about 3 hour lifespans.
+This is a technique for increasing the scalability of the Condor collector.  This has been found to help scale up glidein pools using GSI authentication in order to scale beyond ~5000 slots to ~20000 slots.  Other strong authentication methods are similarly CPU intensive, so they should also benefit from this technique.  When authenticating across the wide area network, network latency is actually more of a problem for the collector than CPU usage during authentication.  The multi-tier collector approach helps distribute the latency and CPU usage across an adjustable number of collectors.  The reason why this is particularly relevant to glidein pools is that these pools typically have shorter lived startds than dedicated pools, so new security sessions need to be established more often.  In the case mentioned where we needed to configure a multi-tier collector to scale beyond 5k slots, the glideins were restarting (unsynchronized) with an average of about 3 hour lifespans.
 
-The basic idea is to have multiple collectors that each individually serve a portion of the pool.  The machine {quote: ClassAd} that are sent to this collector are forwarded to one main collector (the central manager).  The main collector is used for matchmaking purposes.  All of these collectors could exist on the same machine, in which case you would want to make sure there are multiple CPUs/cores, or they could be located on separate machines.
+The basic idea is to have multiple collectors that each individually serve a portion of the pool.  The machine {quote: ClassAd} that are sent to this collector are forwarded to one main collector (the central manager).  The main collector is used for matchmaking purposes.  All of these collectors could exist on the same machine, in which case you would want to make sure there are sufficient CPUs/cores, or they could be located on separate machines.
 
 Assuming you are running the collectors on the same machine, you will need to assign a different network port to each of them.  The main collector can use the standard port, to keep things simpler.  Here is how you could configure it to create 3 "sub" collectors on ports 10002-10004 (arbitrarily chosen) and to have them forward {quote: ClassAd} to the main collector.
 
@@ -34,3 +34,5 @@
 
 
 Then you would configure a fraction of your pool (execute machines) to use one of the sub collectors.  It is tempting to use something like {code}COLLECTOR_HOST=$RANDOM_CHOICE(collector.hostname:10002,collector.hostname:10003,collector.hostname:10004){endcode} I haven't tested that, so I am not 100% sure there are no problems with doing that.  Your schedds and negotiator should be configured with COLLECTOR_HOST equal to the main collector.
+
+How many sub-collectors are required?  One data point: we have successfully used 70 sub-collectors on a single machine to support a pool of 100k GSI-authenticated daemons (masters and startds) with average lifespan of 3h and network round-trip-times of 0.1s (cross-Atlantic).  The daemons in the pool were also using these collectors as their CCB servers.