Hi there, -I am in the process of setting up Condor as a batch queueing/scheduling +I am in the process of setting up HTCondor as a batch queueing/scheduling system for our Beowulf cluster. But before I can get started I already ran into a few problems: @@ -16,7 +16,7 @@ However, the release.tar contains only what looks like static libraries (lib*.a). Am I missing something? -2) I am installing Condor on the master node of the cluster and want to +2) I am installing HTCondor on the master node of the cluster and want to use it on the private network 172.16.0.0. The hostname structure is 172.16.0.1 b001 (master node) @@ -39,12 +39,12 @@ used for MPI communication, the second (on the 172.17.0.0 network) is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each of the condor_config.local files. - Does it create a problem with Condor that NFS traffic (including - exporting of the Condor home directory and the Condor binaries and + Does it create a problem with HTCondor that NFS traffic (including + exporting of the HTCondor home directory and the HTCondor binaries and libraries) is actually on a different interface? Thanks a lot for your help in advance! -Also, thanks for making Condor available. +Also, thanks for making HTCondor available. Regards, Martin @@ -68,7 +68,7 @@ Hello, -> I am in the process of setting up Condor as a batch queueing/scheduling +> I am in the process of setting up HTCondor as a batch queueing/scheduling > system for our Beowulf cluster. But before I can get started I already > ran into a few problems: > @@ -80,7 +80,7 @@ No. We statically link in all of our own libraries, hence the size of the executables. -> 2) I am installing Condor on the master node of the cluster and want to +> 2) I am installing HTCondor on the master node of the cluster and want to > use it on the private network 172.16.0.0. The hostname structure is > > 172.16.0.1 b001 (master node) @@ -109,8 +109,8 @@ > used for MPI communication, the second (on the 172.17.0.0 network) > is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each > of the condor_config.local files. -> Does it create a problem with Condor that NFS traffic (including -> exporting of the Condor home directory and the Condor binaries and +> Does it create a problem with HTCondor that NFS traffic (including +> exporting of the HTCondor home directory and the HTCondor binaries and > libraries) is actually on a different interface? Consensus amongst the development team is no, it shouldn't create any @@ -135,7 +135,7 @@ > Hello, > -> > I am in the process of setting up Condor as a batch queueing/scheduling +> > I am in the process of setting up HTCondor as a batch queueing/scheduling > > system for our Beowulf cluster. But before I can get started I already > > ran into a few problems: > > @@ -147,14 +147,14 @@ > No. We statically link in all of our own libraries, hence the size of the > executables. -Doesn't that mean that every time a job is checkpointed all the condor +Doesn't that mean that every time a job is checkpointed all the HTCondor libraries are dumped into the checkpoint directory for every and each job over and over again leading to an unnecessary high demand for disk space? I would prefer adding /usr/local/condor/lib to /etc/ld.so.conf (just a thought - I have not actually estimated the typical disk space needs on my cluster in the checkpoint directory). -> > 2) I am installing Condor on the master node of the cluster and want to +> > 2) I am installing HTCondor on the master node of the cluster and want to > > use it on the private network 172.16.0.0. The hostname structure is > > > > 172.16.0.1 b001 (master node) @@ -185,10 +185,10 @@ As far as I understand I can set FILESYSTEM_DOMAIN to any string and as long as it is the same on all hosts it will work. -I could set UID_DOMAIN to *. Then I could firewall all condor ports on the +I could set UID_DOMAIN to *. Then I could firewall all HTCondor ports on the master node so that only machines on the private network can talk to the -condor daemons. This seems to be desirable anyway. -Which are the condor ports that I would have to block in that case? +HTCondor daemons. This seems to be desirable anyway. +Which are the HTCondor ports that I would have to block in that case? Anything else that I must consider (with respect to security), if I set UID_DOMAIN = * ? @@ -197,8 +197,8 @@ > > used for MPI communication, the second (on the 172.17.0.0 network) > > is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each > > of the condor_config.local files. -> > Does it create a problem with Condor that NFS traffic (including -> > exporting of the Condor home directory and the Condor binaries and +> > Does it create a problem with HTCondor that NFS traffic (including +> > exporting of the HTCondor home directory and the HTCondor binaries and > > libraries) is actually on a different interface? > > Consensus amongst the development team is no, it shouldn't create any @@ -210,7 +210,7 @@ I'll definitely let you know. But for now I ran into another problem: SMP configuration. All of the machines in my cluster are identical dual processor boxes. -I understand that by default condor would split those boxes up into +I understand that by default HTCondor would split those boxes up into two virtual machines with equal amount of memory. This is not really what I would like to do. I'd rather make all of memory available to the first job and then make [total_memory - (memory taken by first job)] @@ -222,7 +222,7 @@ it into two parts: 96 CPUs dedicated to MPI (and possibly PVM) jobs and 96 CPUs for everything else. Right now I do not have enough MPI jobs to fill the MPI processors, but I have (at least sometimes) more than 192 serial -jobs. I'd like to configure Condor such that a serial job goes preferably +jobs. I'd like to configure HTCondor such that a serial job goes preferably on those 96 processors that are not dedicated to MPI jobs. If those are already busy, serial jobs go onto the MPI processors as well. When a MPI job claims those processor the serial jobs vacate immediately @@ -240,9 +240,9 @@ 3.11.11.5? Or can I actually use "START = True" everywhere, thus just define it in the global config file? -Also, is there a way of forcing users to submit their jobs to Condor -instead of starting them directly in the background? I.e., can Condor -stop jobs that are not started via Condor? +Also, is there a way of forcing users to submit their jobs to HTCondor +instead of starting them directly in the background? I.e., can HTCondor +stop jobs that are not started via HTCondor? Any pointer are appreciated. Thanks a lot for your help! @@ -255,15 +255,15 @@ To: condor-admin@cs.wisc.edu -Subject: Re: [condor-admin #4081] condor on a beowulf +Subject: Re: [condor-admin #4081] HTCondor on a beowulf Date: Tue, 30 Jul 2002 16:39:28 -0500 From: Derek Wright -> Doesn't that mean that every time a job is checkpointed all the condor +> Doesn't that mean that every time a job is checkpointed all the HTCondor > libraries are dumped into the checkpoint directory for every and each job > over and over again leading to an unnecessary high demand for disk space? -basically, yes. however, there is state in the condor libraries +basically, yes. however, there is state in the HTCondor libraries linked in with your code, specific to each job, that can't be shared across the different jobs. @@ -272,7 +272,7 @@ > my cluster in the checkpoint directory). if we did a lot of effort, we might be able to make a shareable -section of the condor libraries and put that in a dynamic library. +section of the HTCondor libraries and put that in a dynamic library. however, the problems associated with checkpointing dynamically linked jobs (particularly on linux) were so awful that we decided it wasn't worth our effort to continue to support it. 80 gig drives are so @@ -286,7 +286,7 @@ sort of. that's not really what you want, though. -as luck would have it, there's active work on the condor team to +as luck would have it, there's active work on the HTCondor team to support pools without fully qualified hostnames. however, i'm not involved with that, so i put that question of yours into a new message in our tracking system. i'll assign it to someone else, and let them @@ -311,20 +311,20 @@ To: condor-admin@cs.wisc.edu -Subject: Re: [condor-admin #4081] condor on a beowulf +Subject: Re: [condor-admin #4081] HTCondor on a beowulf Date: Tue, 30 Jul 2002 19:10:49 -0500 From: Derek Wright > SMP configuration. > All of the machines in my cluster are identical dual processor boxes. -> I understand that by default condor would split those boxes up into +> I understand that by default HTCondor would split those boxes up into > two virtual machines with equal amount of memory. This is not really > what I would like to do. I'd rather make all of memory available to > the first job and then make [total_memory - (memory taken by first job)] > available to the second machine. Is that possible? no. i agree with you, that's how i think it should work, too. -however, when i was implementing the smp support in condor, i was told +however, when i was implementing the smp support in HTCondor, i was told it had to work the way it does, so that's how it is. here's the deal: you can partition the memory however you want between the two cpus (50/50, 73/27, whatever you want), but you have to do it ahead of @@ -354,10 +354,10 @@ > but I have (at least sometimes) more than 192 serial jobs. no problem. this mix of dedicated parallel jobs and "opportunistic" -serial jobs is exactly the kind of environment condor is setup to +serial jobs is exactly the kind of environment HTCondor is setup to handle. -> I'd like to configure Condor such that a serial job goes preferably +> I'd like to configure HTCondor such that a serial job goes preferably > on those 96 processors that are not dedicated to MPI jobs. you would do this with the "Rank" expression in the job's submit @@ -434,7 +434,7 @@ > processors jobs of the user currently occupy, i.e., the "history" > should not play a role. -that's a different issue. this is the responsibility of the condor +that's a different issue. this is the responsibility of the HTCondor "accountant", which lives inside the condor_negotiator daemon. the knob you want to turn is called "PRIORITY_HALFLIFE". think of your user priority as a radioactive substance. :) consider a priority that @@ -451,7 +451,7 @@ > Quite honestly: I am kind of lost after having read through various > chapters of the manual. -yeah, condor is incredibly flexible, therefore, incredibly complicated +yeah, HTCondor is incredibly flexible, therefore, incredibly complicated to configure and document. i'm sorry. we get this stuff wrong in our own pool on occassion, even with the developers editing the config files. :) @@ -470,7 +470,7 @@ > Also I guess that I should set KeyboardIdle to True and KeyboardBusy > and ConsoleBusy to False, correct? -KeyboardIdle and ConsoleIdle are computed for you by condor. they're +KeyboardIdle and ConsoleIdle are computed for you by HTCondor. they're just a measure of how long it's been since someone logged into a tty, or touched the physical console. you can't set them to anything. however, all that really matters are the "policy expressions" (start, @@ -505,9 +505,9 @@ "START = True" in your global config file and that'll be the behavior on all the machines in the pool. -> Also, is there a way of forcing users to submit their jobs to Condor +> Also, is there a way of forcing users to submit their jobs to HTCondor > instead of starting them directly in the background? I.e., can -> Condor stop jobs that are not started via Condor? +> HTCondor stop jobs that are not started via HTCondor? sort of, but it's quite convoluted at this point. your best bet for now is to just have a cron job run every minute or so, check for @@ -517,7 +517,7 @@ i hope this helps. if you have futher questions about any of this, reply to this message. if you have other questions or comments about -condor, just send a new note to condor-admin. +HTCondor, just send a new note to condor-admin. thanks, -derek