{section: How to load balance users to one of many submit nodes} Known to work with 8.6 {subsection: Overview} It is common for uses of HPC systems to have a shared file system and a pool of submit servers that users are assigned to dynamically and transparently based on load. It is transparent to users because all of their files are on a shared file system, so which machine is running their job is something that they need not know in many cases. This use model can be approximated with HTCondor, but there are some differences which should be noted. 1: The HTCondor job queue is not shared between submit nodes, but any node can query any other nodes job queue by setting configuration parameters or using command line arguments. It is also possible to query ALL jobs queues with a single command, although this is not fully transparent to the user. 1: other differences? {subsection: Strategy} The basic strategy is to make use of HTCondor's ability to customize the configuration per-user so that any given users jobs will always go to a specific schedd regardless of which submit machine they are currently logged in to. This can be almost completely transparent to the user if all of the user's files are on a shared file system. {subsection: Implementation} The administrator, or the user will add a file into the users =~/.condor/user_config= file. That file will the configuration variable {snip: user_config fragment}SCHEDD_HOST={endsnip} that will cause any queue management commands, like condor_submit, condor_hold and condor_q to be routed to a specific schedd named ==. Where == is the same name that you would see in the output of the command {term}condor_status -schedd{endterm}{verbatim} Name Machine RunningJobs IdleJobs HeldJobs submit-3.chtc.wisc.edu submit-3.chtc.wisc.edu 2638 47950 7870 submit-4.chtc.wisc.edu submit-4.chtc.wisc.edu 1990 49497 8977 submit-5.chtc.wisc.edu submit-5.chtc.wisc.edu 11586 28108 6966 testsubmit.chtc.wisc.edu testsubmit.chtc.wisc.edu 0 0 1{endverbatim} So, for instance you might configure the user *tj* to use submit-3 {snip: fragment of /home/tj/.condor/user_config}SCHEDD_HOST=submit-3.chtc.wisc.edu{endsnip} Now whenever user *tj* runs =condor_submit=, or =condor_q=, the submit/query will go to the schedd called =submit-3.chtc.wisc.edu=. As long as *tj* user has no jobs in any of the HTCondor schedd queues, he can be moved to a new schedd merely by changing the contents of =/home/tj/.condor/user_config= file. This can happen even while *tj* is logged in, since the file is re-parsed by each invocation of =condor_q= or =condor_submit= Whether or not *tj* has any jobs in any of the schedd's can be determined by running {term}condor_status -submit tj@chtc.wisc.edu -af Machine 'HeldJobs+IdleJobs+RunningJobs+LocalJobsIdle+LocalJobsRunning'{endterm}{verbatim} submit-3.chtc.wisc.edu 3{endverbatim} or the equivalent command via the HTCondor python bindings. {subsection: Schedd failover} If the schedd's job queue is stored on the shared file system, HTCondor can be configured so that when a given machine fails, any schedd's running on it will be automatically restarted on the backup machine, but with the same schedd name. *details to be written*