HTCondorWiki: Whole Machine Slots

Page History

How to allow some jobs to claim the whole machine instead of one slot

Known to work with Condor version: 7.4

The simplest way to achieve this is this:

SLOT_TYPE_1 = cpus=100%
NUM_SLOTS_TYPE_1 = 1

With that configuration, each machine just advertises a single slot. However, this prevents you from supporting a mix of single-cpu and whole-machine jobs. The following example supports both single-cpu and whole-machine jobs.

First, you would have whole-machine jobs advertise themselves as such with something like the following in the submit file:

+RequiresWholeMachine = True

The job should also require that it run on a whole-machine if this is a requirement. The following example to be put in the submit file should be merged together with whatever other requirements the job may have.

Requirements = CAN_RUN_WHOLE_MACHINE

Then put the following in your Condor configuration file. Make sure it either comes after the attributes to which this policy appends (such as START) or that you merge the definitions together.


# we will double-allocate resources to overlapping slots
NUM_CPUS = $(DETECTED_CORES)*2
MEMORY = $(DETECTED_MEMORY)*2

# single-core slots get 1 core each
SLOT_TYPE_1 = cpus=1
NUM_SLOTS_TYPE_1 = $(DETECTED_CORES)

# whole-machine slot gets as many cores and RAM as the machine has
SLOT_TYPE_2 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY)
NUM_SLOTS_TYPE_2 = 1

# Macro specifying the slot id of the whole-machine slot
# Example: on an 8-core machine, the whole-machine slot is 9.
WHOLE_MACHINE_SLOT = ($(DETECTED_CORES)+1)

# ClassAd attribute that is True/False depending on whether this slot is
# the whole-machine slot
CAN_RUN_WHOLE_MACHINE = SlotID == $(WHOLE_MACHINE_SLOT)
STARTD_EXPRS = $(STARTD_EXPRS) CAN_RUN_WHOLE_MACHINE

# advertise state of each slot as SlotX_State in ClassAds of all other slots
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) State

# Macro for referencing state of the whole-machine slot.
# Relies on eval(), which was added in Condor 7.3.2.
WHOLE_MACHINE_SLOT_STATE = \
  eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_State"))

# Macro that is true if any single-core slots are claimed
# WARNING: THERE MUST BE AN ENTRY FOR ALL SINGLE-CORE SLOTS
# IN THE EXPRESSION BELOW.  If you have more slots, you must
# extend this expression to cover them.  If you have fewer
# slots, extra entries are harmless.
SINGLE_CORE_SLOTS_CLAIMED = \
  ($(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed") < \
  (Slot1_State =?= "Claimed") + \
  (Slot2_State =?= "Claimed") + \
  (Slot3_State =?= "Claimed") + \
  (Slot4_State =?= "Claimed") + \
  (Slot5_State =?= "Claimed") + \
  (Slot6_State =?= "Claimed") + \
  (Slot7_State =?= "Claimed") + \
  (Slot8_State =?= "Claimed") + \
  (Slot9_State =?= "Claimed") + \
  (Slot10_State =?= "Claimed") + \
  (Slot11_State =?= "Claimed") + \
  (Slot12_State =?= "Claimed") + \
  (Slot13_State =?= "Claimed") + \
  (Slot14_State =?= "Claimed") + \
  (Slot15_State =?= "Claimed") + \
  (Slot16_State =?= "Claimed")

# Single-core jobs must run on single-core slots
START_SINGLE_CORE_JOB = \
  TARGET.RequiresWholeMachine =!= True && MY.CAN_RUN_WHOLE_MACHINE == False && \
  $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed"

# Whole-machine jobs must run on the whole-machine slot
START_WHOLE_MACHINE_JOB = \
  TARGET.RequiresWholeMachine =?= True && MY.CAN_RUN_WHOLE_MACHINE

START = ($(START)) && ( \
  ($(START_SINGLE_CORE_JOB)) || \
  ($(START_WHOLE_MACHINE_JOB)) )

# Suspend the whole-machine job until single-core jobs finish.
SUSPEND = ($(SUSPEND)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE && ($(SINGLE_CORE_SLOTS_CLAIMED)) )

CONTINUE = ( $(SUSPEND) =!= True )

WANT_SUSPEND = ($(WANT_SUSPEND)) || ($(SUSPEND))

Instead of suspending the whole-machine job until single-cpu jobs finish, we could instead suspend single-core jobs while whole-machine jobs run. Replacing the SUSPEND expression above with the following would achieve this policy. Beware that this could cause single-core jobs to be suspended indefinitely if there is a steady supply of whole-machine jobs.

# Suspend single-core jobs while the whole-machine job runs
SUSPEND = ($(SUSPEND)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

Another possible policy is to suspend the whole machine job if there are any single core jobs but to kick the single-core jobs off if they take too long to finish. Starting with the original example policy, this can be achieved by adding the following to the configuration:

PREEMPT = ($(PREEMPT)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

# When a job is preempted, let it run for up to a day before killing it
MaxJobRetirementTime = 24*3600

Accounting and Monitoring

The above policies rely on job suspension. Should the jobs be "charged" for the time they spend in a suspended state? This affects the user's fair-share priority and the accumulated number of hours reported by condor_userprio. As of Condor 7.4, the default behavior is to charge the jobs for time they spend in a suspended state. There is a configuration variable, NEGOTIATOR_DISCOUNT_SUSPENDED_RESOURCES that can be used to get the opposite behavior.

Should the whole-machine slot charge more than the single-core slots? The policy for this is determined by SlotWeight. By default, this is equal to the number of cores associated with the slot, so usage reported in condor_userprio will count the whole-machine slot on an 8-core machine as 8 times the usage reported for a single-core slot.

There are a few things to keep in mind when interpreting the output of condor_status. The TotalCpus and TotalMemory attributes are double what the machine actually has, because we are double-counting these resources across the single-core and whole-machine slots.

When looking at condor_status, the extra slot representing the whole machine is visible. Notice that it appears in the Owner state, because the start expression we have configured is false unless presented with a whole-machine job. We could modify the IsOwner expression to make the slot appear as Unclaimed rather than Owner, but we find that Owner is a nice reminder that this slot is special and is not available to run "normal" jobs. In fact, it turns out that transitions between Owner and Unclaimed state have an additional important side-effect: they cause the machine to send an updated ad to the collector whenever the machine's willingness to run a job changes. Without this, it can take as long as UPDATE_INTERVAL for the collector to be updated.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@execute.node LINUX      X86_64 Unclaimed Idle     0.750  2006  0+00:00:04
slot2@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:05
slot3@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:06
slot4@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:07
slot5@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:08
slot6@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:09
slot7@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:10
slot8@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:03
slot9@execute.node LINUX      X86_64 Owner     Idle     0.000  16054  0+00:00:08

To filter out the whole-machine slot, one could use a constraint such as the following:

condor_status -constraint 'CAN_RUN_WHOLE_MACHINE =!= True'

When the whole-machine slot is claimed, the other slots will appear in the Owner state, because they are configured to not start any new jobs while the whole-machine slot is claimed. Again, we could make them appear as Unclaimed by changing the IsOwner expression, but the Owner state serves as a useful reminder that these slots are reserved during this time, and it has the side-effect of forcing timely updates of the slot ad to the collector.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@execute.node LINUX      X86_64 Owner     Idle     0.870  2006  0+00:00:04
slot2@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:05
slot3@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:06
slot4@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:07
slot5@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:08
slot6@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:09
slot7@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:10
slot8@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:03
slot9@execute.node LINUX      X86_64 Claimed   Busy     0.000  16054  0+00:00:04

Since the whole-machine policy depends on STARTD_SLOT_EXPRS, it can take a few iterations of ClassAd updates for the state of the slots to converge. For this reason, the information visible in condor_status may lag behind the true state of the slots when conditions have recently changed. It is therefore useful to use the -direct option to condor_status when verifying that the whole-machine policy is correctly working. As noted above, with IsOwner set in the default way so that slots enter the Owner state when they are unwilling to run new jobs, the information published in the collector will be updated much more quickly than if IsOwner is set to false. It is therefore highly recommended that IsOwner not be set to false.