How to load balance users to one of many submit nodes
Known to work with 8.6
Overview
It is common for uses of HPC systems to have a shared file system and a pool of submit servers that users are assigned to dynamically and transparently based on load. It is transparent to users because all of their files are on a shared file system, so which machine is running their job is something that they need not know in many cases.
This use model can be approximated with HTCondor, but there are some differences which should be noted.
- The HTCondor job queue is not shared between submit nodes, but any node can query any other nodes job queue by setting configuration parameters or using command line arguments. It is also possible to query ALL jobs queues with a single command, although this is not fully transparent to the user.
- other differences?
Strategy
The basic strategy is to make use of HTCondor's ability to customize the configuration per-user so that any given users jobs will always go to a specific schedd regardless of which submit machine they are currently logged in to. This can be almost completely transparent to the user if all of the user's files are on a shared file system.
Implementation
The administrator, or the user will add a file into the users ~/.condor/user_config
file. That file will the configuration variable
user_config fragment
SCHEDD_HOST=<schedd-name>
<schedd-name>
. Where <schedd-name>
is the same name that you would see in the output of the command condor_status -schedd
Name Machine RunningJobs IdleJobs HeldJobs submit-3.chtc.wisc.edu submit-3.chtc.wisc.edu 2638 47950 7870 submit-4.chtc.wisc.edu submit-4.chtc.wisc.edu 1990 49497 8977 submit-5.chtc.wisc.edu submit-5.chtc.wisc.edu 11586 28108 6966 testsubmit.chtc.wisc.edu testsubmit.chtc.wisc.edu 0 0 1
So, for instance you might configure the user tj to use submit-3
fragment of /home/tj/.condor/user_config
SCHEDD_HOST=submit-3.chtc.wisc.edu
Now whenever user tj runs condor_submit
, or condor_q
, the submit/query will go to the schedd called submit-3.chtc.wisc.edu
.
As long as tj user has no jobs in any of the HTCondor schedd queues, he can be moved to a new schedd merely by changing the contents of /home/tj/.condor/user_config
file. This can happen even while tj is logged in, since the file is re-parsed by each invocation of condor_q
or condor_submit
Whether or not tj has any jobs in any of the schedd's can be determined by running
condor_status -submit tj@chtc.wisc.edu -af Machine 'HeldJobs+IdleJobs+RunningJobs+LocalJobsIdle+LocalJobsRunning'
submit-3.chtc.wisc.edu 3
Schedd failover
If the schedd's job queue is stored on the shared file system, HTCondor can be configured so that when a given machine fails, any schedd's running on it will be automatically restarted on the backup machine, but with the same schedd name.
details to be written