HTCondorWiki: Startd For Local Universe

Page History

Startd for local universe.

Today Condor supports many universes. However, the "universe" concept contains at least two ideas: A "universe" can related to the type of job, and the support that a certain kind of job needs at runtime, such as "standard universe" (i.e. checkpointing), "java universe", "VM universe", and "vanilla universe". Completely orthogonal to this, the universe can relate to the scheduling of jobs. The "parallel universe", the "grid universe", and the local and scheduler universe are of this kind.

The local universe runs jobs managed by a starter, but this starter is forked by the schedd. Local universe jobs always run on the submit machine, and generally many more are allowed to run than there are cores on the submit machine. Unlike vanilla and standard universe, there is no startd or shadow.

There are about twenty places in the code where we switch on the value of LOCAL_UNIVERSE, and so there are a number of small but surprising differences between the way that local universe jobs and vanilla behave. Lately, the Pegasus folks would like to be able to seemless change job universes between local and vanilla without seeing any of these semantic differences. It would also be useful for the "checkpoint local and run remote" trick that LIGO would do. Other groups have also noticed the difference between local and vanilla. Most recently, CHTC jobs failed in the local universe for issues having to do with the setting of the TEMP environment variable to the sandbox directory.

So, to make local universe jobs have as much as the same semantics as vanilla, we will change local universe jobs to use a startd and shadow. This will increase the overhead of local universe somewhat. We will not change the scheduler universe, so that uses that are very resources constrained can still use that.

The startd the schedd uses for running local universe jobs will be permanently claimed by the schedd, so it will never need the overhead of waiting for a negotiation cycle. As it is today, there will be no user priority based preemption of one local universe job for another.

This startd will be forked and managed by the master, and will report not to the collector, but only to the schedd. It will use partitionable slots, with a large but fixed number of slots, much like MAX_SCHEDULER_JOBS_RUNNING today limits the number of local universe jobs running.

Testing. In addition to ad-hoc testing, most every