This is RUST Ticket #4081. It contains MPI Wisdom from Derek, as well as a few other tidbits. From: Martin Siegert To: condor-admin@cs.wisc.edu Subject: condor on a beowulf Hi there, I am in the process of setting up HTCondor as a batch queueing/scheduling system for our Beowulf cluster. But before I can get started I already ran into a few problems: 1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz. I was expecting that this would contain shared libraries (lib*.so). However, the release.tar contains only what looks like static libraries (lib*.a). Am I missing something? 2) I am installing HTCondor on the master node of the cluster and want to use it on the private network 172.16.0.0. The hostname structure is 172.16.0.1 b001 (master node) 172.16.0.2 b002 172.16.0.3 b003 ... Following section 3.11.9 of the manual I set CONDOR_HOST = b001 but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN? I do use a common uid space and I NFS export /home and /usr/local from the master node. However, I do not have a domain name (which would unnecessarily complicate management of the cluster). Thus, do I set UID_DOMAIN and FILESYSTEM_DOMAIN to nothing (empty string)? Also, should I set DEFAULT_DOMAIN_NAME = '' ? 3) All nodes have two interfaces one (on the 172.16.0.0 network) is used for MPI communication, the second (on the 172.17.0.0 network) is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each of the condor_config.local files. Does it create a problem with HTCondor that NFS traffic (including exporting of the HTCondor home directory and the HTCondor binaries and libraries) is actually on a different interface? Thanks a lot for your help in advance! Also, thanks for making HTCondor available. Regards, Martin =========================================================================== Date of creation: Tue Jul 16 12:41:32 2002 (1026841296) From RUST Tue, 16 Jul 2002 14:15:20 -0600 (CST) Subject: Actions Assigned to ned by ned =========================================================================== Date of actions: Tue Jul 16 14:15:20 2002 (1026846921) Date: Tue, 16 Jul 2002 16:45:27 -0500 From: James Kirby To: ned Subject: Re: [condor-admin #4081] condor on a beowulf Hello, > I am in the process of setting up HTCondor as a batch queueing/scheduling > system for our Beowulf cluster. But before I can get started I already > ran into a few problems: > > 1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz. > I was expecting that this would contain shared libraries (lib*.so). > However, the release.tar contains only what looks like static > libraries (lib*.a). Am I missing something? No. We statically link in all of our own libraries, hence the size of the executables. > 2) I am installing HTCondor on the master node of the cluster and want to > use it on the private network 172.16.0.0. The hostname structure is > > 172.16.0.1 b001 (master node) > 172.16.0.2 b002 > 172.16.0.3 b003 > ... > > Following section 3.11.9 of the manual I set > > CONDOR_HOST = b001 > > but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN? You need to "make up" a domain for your machines, like so: 172.16.0.1 b001.qux.foo.bar.com b001 (master node) 172.16.0.2 b002.qux.foo.bar.com b002 172.16.0.3 b003.qux.foo.bar.com b002 Then just set UID_DOMAIN and FILESYSTEM_DOMAIN to qux.foo.bar.com. If this is set up correctly, you won't need the DEFAULT_DOMAIN_NAME config file entry. > 3) All nodes have two interfaces one (on the 172.16.0.0 network) is > used for MPI communication, the second (on the 172.17.0.0 network) > is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each > of the condor_config.local files. > Does it create a problem with HTCondor that NFS traffic (including > exporting of the HTCondor home directory and the HTCondor binaries and > libraries) is actually on a different interface? Consensus amongst the development team is no, it shouldn't create any problems, but neither has anyone run across this before. If you do run into problems, please let us know. --condor-admin =========================================================================== Date mail was appended: Tue Jul 16 16:45:27 2002 (1026855928) Date: Tue, 23 Jul 2002 15:00:51 -0700 From: Martin Siegert To: condor-admin response tracking system Subject: Re: [condor-admin #4081] condor on a beowulf Hello again, On Tue, Jul 16, 2002 at 04:45:27PM -0600, condor-admin response tracking system wrote: > Hello, > > > I am in the process of setting up HTCondor as a batch queueing/scheduling > > system for our Beowulf cluster. But before I can get started I already > > ran into a few problems: > > > > 1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz. > > I was expecting that this would contain shared libraries (lib*.so). > > However, the release.tar contains only what looks like static > > libraries (lib*.a). Am I missing something? > > No. We statically link in all of our own libraries, hence the size of the > executables. Doesn't that mean that every time a job is checkpointed all the HTCondor libraries are dumped into the checkpoint directory for every and each job over and over again leading to an unnecessary high demand for disk space? I would prefer adding /usr/local/condor/lib to /etc/ld.so.conf (just a thought - I have not actually estimated the typical disk space needs on my cluster in the checkpoint directory). > > 2) I am installing HTCondor on the master node of the cluster and want to > > use it on the private network 172.16.0.0. The hostname structure is > > > > 172.16.0.1 b001 (master node) > > 172.16.0.2 b002 > > 172.16.0.3 b003 > > ... > > > > Following section 3.11.9 of the manual I set > > > > CONDOR_HOST = b001 > > > > but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN? > > You need to "make up" a domain for your machines, like so: > > > 172.16.0.1 b001.qux.foo.bar.com b001 (master node) > 172.16.0.2 b002.qux.foo.bar.com b002 > 172.16.0.3 b003.qux.foo.bar.com b002 > > Then just set UID_DOMAIN and FILESYSTEM_DOMAIN to qux.foo.bar.com. If this > is set up correctly, you won't need the DEFAULT_DOMAIN_NAME config file > entry. Hmm. That isn't really an option. Too many scripts would break that rely on the short hostnames. Is there a way to change the FULL_HOSTNAME macro (I guess that that is determining how hostnames get interpreted)? As far as I understand I can set FILESYSTEM_DOMAIN to any string and as long as it is the same on all hosts it will work. I could set UID_DOMAIN to *. Then I could firewall all HTCondor ports on the master node so that only machines on the private network can talk to the HTCondor daemons. This seems to be desirable anyway. Which are the HTCondor ports that I would have to block in that case? Anything else that I must consider (with respect to security), if I set UID_DOMAIN = * ? > > 3) All nodes have two interfaces one (on the 172.16.0.0 network) is > > used for MPI communication, the second (on the 172.17.0.0 network) > > is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each > > of the condor_config.local files. > > Does it create a problem with HTCondor that NFS traffic (including > > exporting of the HTCondor home directory and the HTCondor binaries and > > libraries) is actually on a different interface? > > Consensus amongst the development team is no, it shouldn't create any > problems, but neither has anyone run across this before. If you do run > into problems, please let us know. > > --condor-admin I'll definitely let you know. But for now I ran into another problem: SMP configuration. All of the machines in my cluster are identical dual processor boxes. I understand that by default HTCondor would split those boxes up into two virtual machines with equal amount of memory. This is not really what I would like to do. I'd rather make all of memory available to the first job and then make [total_memory - (memory taken by first job)] available to the second machine. Is that possible? Also: can you give me some hints what would be reasonable policies for a cluster? This is what I'd like to do: the cluster has 192 cpus. I'd like to split it into two parts: 96 CPUs dedicated to MPI (and possibly PVM) jobs and 96 CPUs for everything else. Right now I do not have enough MPI jobs to fill the MPI processors, but I have (at least sometimes) more than 192 serial jobs. I'd like to configure HTCondor such that a serial job goes preferably on those 96 processors that are not dedicated to MPI jobs. If those are already busy, serial jobs go onto the MPI processors as well. When a MPI job claims those processor the serial jobs vacate immediately and get back to the top of the serial queue. Preferably the user priority is determined by the number of processors jobs of the user currently occupy, i.e., the "history" should not play a role. Quite honestly: I am kind of lost after having read through various chapters of the manual. I do not have a "Owner" - thus I understand that I somehow should guarantee that Owner evaluates to UNDEFINED always. Is that the default? Also I guess that I should set KeyboardIdle to True and KeyboardBusy and ConsoleBusy to False, correct? What else should I set? According to chapter 3.11.11 half of the machines will be defined as dedicated resources in their local config files. In order to let serial jobs onto those processor do I just set "START = True" as mentioned under 3.11.11.5? Or can I actually use "START = True" everywhere, thus just define it in the global config file? Also, is there a way of forcing users to submit their jobs to HTCondor instead of starting them directly in the background? I.e., can HTCondor stop jobs that are not started via HTCondor? Any pointer are appreciated. Thanks a lot for your help! Cheers, Martin Assigned to wright by ned =========================================================================== Date of actions: Mon Jul 29 13:47:30 2002 (1027968451) To: condor-admin@cs.wisc.edu Subject: Re: [condor-admin #4081] HTCondor on a beowulf Date: Tue, 30 Jul 2002 16:39:28 -0500 From: Derek Wright > Doesn't that mean that every time a job is checkpointed all the HTCondor > libraries are dumped into the checkpoint directory for every and each job > over and over again leading to an unnecessary high demand for disk space? basically, yes. however, there is state in the HTCondor libraries linked in with your code, specific to each job, that can't be shared across the different jobs. > I would prefer adding /usr/local/condor/lib to /etc/ld.so.conf (just a > thought - I have not actually estimated the typical disk space needs on > my cluster in the checkpoint directory). if we did a lot of effort, we might be able to make a shareable section of the HTCondor libraries and put that in a dynamic library. however, the problems associated with checkpointing dynamically linked jobs (particularly on linux) were so awful that we decided it wasn't worth our effort to continue to support it. 80 gig drives are so cheap now, we're not loosing any sleep over the slightly inefficient use of space... > Hmm. That isn't really an option. Too many scripts would break that rely > on the short hostnames. Is there a way to change the FULL_HOSTNAME > macro (I guess that that is determining how hostnames get interpreted)? sort of. that's not really what you want, though. as luck would have it, there's active work on the HTCondor team to support pools without fully qualified hostnames. however, i'm not involved with that, so i put that question of yours into a new message in our tracking system. i'll assign it to someone else, and let them reply to all of the domain-related stuff. > But for now I ran into another problem: for future reference, it's a lot easier for us if you just send a new message to condor-admin with completely new questions. most of the time, different people are the best to answer any given question. if there's a bunch of background information in common between the two problems, you can always reference the tracking number of your old ticket (in this case [condor-admin #4081]) in the new one, and any of us can read the other ticket if we need to... i'll reply to your SMP and MPI questions in a little while. -derek =========================================================================== Date mail was appended: Tue Jul 30 16:39:28 2002 (1028065169) To: condor-admin@cs.wisc.edu Subject: Re: [condor-admin #4081] HTCondor on a beowulf Date: Tue, 30 Jul 2002 19:10:49 -0500 From: Derek Wright > SMP configuration. > All of the machines in my cluster are identical dual processor boxes. > I understand that by default HTCondor would split those boxes up into > two virtual machines with equal amount of memory. This is not really > what I would like to do. I'd rather make all of memory available to > the first job and then make [total_memory - (memory taken by first job)] > available to the second machine. Is that possible? no. i agree with you, that's how i think it should work, too. however, when i was implementing the smp support in HTCondor, i was told it had to work the way it does, so that's how it is. here's the deal: you can partition the memory however you want between the two cpus (50/50, 73/27, whatever you want), but you have to do it ahead of time. if there's no job running on the machine, you can re-partition the memory if you want. but, there's no way, while a job is running, to change it. the "party line" is that if you really need to dynamically partition the memory on a per-job basis, you should write your own script/daemon to monitor your job queue, and decide when and how to repartition the memory on your machines. i think this is highly unsatisfactory, but that's what i'm supposed to tell you. i wish it could be otherwise, but that's just how it is. i'm sorry. > Also: can you give me some hints what would be reasonable policies for > a cluster? sure. > This is what I'd like to do: the cluster has 192 cpus. I'd like to > split it into two parts: 96 CPUs dedicated to MPI (and possibly PVM) > jobs and 96 CPUs for everything else. sounds good (assuming you never need more than 96 cpus for MPI) > Right now I do not have enough MPI jobs to fill the MPI processors, > but I have (at least sometimes) more than 192 serial jobs. no problem. this mix of dedicated parallel jobs and "opportunistic" serial jobs is exactly the kind of environment HTCondor is setup to handle. > I'd like to configure HTCondor such that a serial job goes preferably > on those 96 processors that are not dedicated to MPI jobs. you would do this with the "Rank" expression in the job's submit file. you can do this 1 of two ways: 1) educate your users to put it in their own Rank expression 2) have condor_submit automatically append it to their Rank for them either way, you want the serial jobs to prefer to run on machines with the "DedicatedScheduler" attribute undefined (therefore, unable to run MPI jobs). like this: Rank = DedicatedScheduler =?= UNDEFINED i don't know if you're already using the job rank for anything. if so, you may want to give this preference more weight. for example, you might want to run on machines with more memory, but you'd much rather run on a small-memory non-mpi machine than a high memory mpi machine. you'd do that like this: Rank = (Memory) + ((DedicatedScheduler =?= UNDEFINED) * 10000) assuming you don't have any machines with 10 gigs of RAM in them :), if DedicatedScheduler is undefined, that whole side of it will evaluate (1 * 10000) and if you add that to, say 512 megs of ram on a "small memory" non-mpi machine, you'll always get more than the 2048 the rank would evalute to for a high memory mpi node. make sense? it's probably best to just have submit add this for you, so you don't have to worry about users forgetting. the "append_rank" expression in your config file will do this for you, like this: APPEND_RANK = (DedicatedScheduler =?= UNDEFINED) * 10000 condor_submit is smart enough that if you specify something to append, but the user doesn't define anything, the expression to append is just used as the rank. if the user defines anything, it's put in ()'s, and your expression is appended like this: Rank = (their expression) + (your expression) or, in this case: Rank = (Memory) + ((DedicatedScheduler =?= UNDEFINED) * 10000) > If those are already busy, serial jobs go onto the MPI processors as > well. job rank is just a preference. it's not a requirement (since you put this restriction in "rank", not "requirements"). so, if the job doesn't have anywhere to go that it really prefers, it'll run anywhere that satisfies its requirements. so, you don't have to do anything special to get this behavior. > When a MPI job claims those processor the serial jobs vacate > immediately that's exactly what happens when the machine's Rank expression is used to specify what kinds of jobs a machine prefers to run. that's why you're supposed to put: RANK = Scheduler =?= $(DedicatedScheduler) in your config file for resources that are going to be dedicated. this way, if an MPI job comes along, dedicated resources will always evict non-mpi jobs to run the mpi job instead. > and get back to the top of the serial queue. yeah, basically. it's not really seperate queues, but that doesn't matter. the behavior is what you expect... > Preferably the user priority is determined by the number of > processors jobs of the user currently occupy, i.e., the "history" > should not play a role. that's a different issue. this is the responsibility of the HTCondor "accountant", which lives inside the condor_negotiator daemon. the knob you want to turn is called "PRIORITY_HALFLIFE". think of your user priority as a radioactive substance. :) consider a priority that exactly matches your current resource usage the "stable state", and a priority "contaminated" with past usage "radioactive." if it's got a long halflife, it takes a long time for your priority to decay back to "normal". if the halflife is very short, it'll decay very quickly, and will remain very close to your current usage. so, just set PRIORITY_HALFLIFE to a small floating point value (like 0.0001), and your user priority should always match your current usage. if you're not using any resources, your priority will go back to the baseline value instantly. > Quite honestly: I am kind of lost after having read through various > chapters of the manual. yeah, HTCondor is incredibly flexible, therefore, incredibly complicated to configure and document. i'm sorry. we get this stuff wrong in our own pool on occassion, even with the developers editing the config files. :) that said, specific feedback on things in the manual that are particularly confusing, or suggestions for improvement are always welcome. (again, just send a new message to condor-admin, so the person who does the editing of the manual can reply to it). > I do not have a "Owner" - thus I understand wthat I somehow should > guarantee that Owner evaluates to UNDEFINED always. Is that the > default? you don't really need to worry about that (read on) > Also I guess that I should set KeyboardIdle to True and KeyboardBusy > and ConsoleBusy to False, correct? KeyboardIdle and ConsoleIdle are computed for you by HTCondor. they're just a measure of how long it's been since someone logged into a tty, or touched the physical console. you can't set them to anything. however, all that really matters are the "policy expressions" (start, suspend, preempt, etc), and if you don't refer to these attributes in any of those expressions, if their values change, it won't impact the policy on the machine. > What else should I set? > > According to chapter 3.11.11 half of the machines will be defined as > dedicated resources in their local config files. In order to let serial > jobs onto those processor do I just set "START = True" as mentioned under > 3.11.11.5? yes. if you want them to always run jobs, you want: START = True if you want them to also be able to run MPI jobs, you want to set: DedicatedScheduler = "DedicatedScheduler@b001" STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler RANK = Scheduler =?= $(DedicatedScheduler) to tell them about MPI stuff. > Or can I actually use "START = True" everywhere, thus just define > it in the global config file? it depends. if you want all of your nodes to run jobs all the time (assuming they're not already running a job), then, yes, you can say "START = True" in your global config file and that'll be the behavior on all the machines in the pool. > Also, is there a way of forcing users to submit their jobs to HTCondor > instead of starting them directly in the background? I.e., can > HTCondor stop jobs that are not started via HTCondor? sort of, but it's quite convoluted at this point. your best bet for now is to just have a cron job run every minute or so, check for processes on your system that are using lots of cpu that aren't children of the condor_master, and take appropriate action (kill -TERM, kill -KILL, whatever). i hope this helps. if you have futher questions about any of this, reply to this message. if you have other questions or comments about HTCondor, just send a new note to condor-admin. thanks, -derek =========================================================================== Date mail was appended: Tue Jul 30 19:10:50 2002 (1028074250) From RUST Tue, 30 Jul 2002 19:10:50 -0600 (CST) Subject: Actions Ticket resolved by wright =========================================================================== Date of actions: Tue Jul 30 19:10:50 2002 (1028074251)