Page History

Turn Off History

How to allow some jobs to claim the whole machine instead of one slot

Known to work with Condor version: 7.4

The simplest way to achieve this is to simply set NUM_CPUS=1 so that each machine just advertises a single slot. However, this prevents you from supporting a mix of single-cpu and whole-machine jobs. The following example supports both single-cpu and whole-machine jobs.

First, you would have whole-machine jobs advertise themselves as such with something like the following in the submit file:

+RequiresWholeMachine = True

The job should also require that it run on a whole-machine if this is a requirement. The following example to be put in the submit file should be merged together with whatever other requirements the job may have.

Requirements = IsWholeMachineSlot

Then put the following in your Condor configuration file. Make sure it either comes after the attributes to which this policy appends (such as START) or that you merge the definitions together.


# we will double-allocate resources to overlapping slots
NUM_CPUS = $(DETECTED_CORES)*2
MEMORY = $(DETECTED_MEMORY)*2

# single-core slots get 1 core each
SLOT_TYPE_1 = cpus=1
NUM_SLOTS_TYPE_1 = $(DETECTED_CORES)

# whole-machine slot gets as many cores and RAM as the machine has
SLOT_TYPE_2 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY)
NUM_SLOTS_TYPE_2 = 1

# Macro specifying the slot id of the whole-machine slot
# Example: on an 8-core machine, the whole-machine slot is 9.
WHOLE_MACHINE_SLOT = ($(DETECTED_CORES)+1)

# ClassAd attribute that is True/False depending on whether this slot is
# the whole-machine slot
IsWholeMachineSlot = SlotID == $(WHOLE_MACHINE_SLOT)
STARTD_EXPRS = $(STARTD_EXPRS) IsWholeMachineSlot

# advertise state of each slot as SlotX_State in ClassAds of all other slots
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) State

# Macro for referencing state of the whole-machine slot.
# Relies on eval(), which was added in Condor 7.3.2.
WHOLE_MACHINE_SLOT_STATE = \
  eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_State"))

# Macro that is true if any single-core slots are claimed
# WARNING: THERE MUST BE AN ENTRY FOR ALL SINGLE-CORE SLOTS
# IN THE EXPRESSION BELOW.  If you have more slots, you must
# extend this expression to cover them.  If you have fewer
# slots, extra entries are harmless.
SINGLE_CORE_SLOTS_CLAIMED = \
  0 == \
  - $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" + \
  Slot1_State =?= "Claimed" + \
  Slot2_State =?= "Claimed" + \
  Slot3_State =?= "Claimed" + \
  Slot4_State =?= "Claimed" + \
  Slot5_State =?= "Claimed" + \
  Slot6_State =?= "Claimed" + \
  Slot7_State =?= "Claimed" + \
  Slot8_State =?= "Claimed" + \
  Slot9_State =?= "Claimed" + \
  Slot10_State =?= "Claimed" + \
  Slot11_State =?= "Claimed" + \
  Slot12_State =?= "Claimed" + \
  Slot13_State =?= "Claimed" + \
  Slot14_State =?= "Claimed" + \
  Slot15_State =?= "Claimed" + \
  Slot16_State =?= "Claimed"

# Single-core jobs must run on single-core slots
START_SINGLE_CORE_JOB = \
  TARGET.RequiresWholeMachine =!= True && MY.IsWholeMachineSlot == False && \
  $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed"

# Whole-machine jobs must run on the whole-machine slot
START_WHOLE_MACHINE_JOB = \
  TARGET.RequiresWholeMachine =?= True && MY.IsWholeMachineSlot

START = ($(START)) && ( \
  ($(START_SINGLE_CORE_JOB)) || \
  ($(START_WHOLE_MACHINE_JOB)) )

# Suspend the whole-machine job until single-core jobs finish.
SUSPEND = ($(SUSPEND)) || ( \
  MY.IsWholeMachineSlot && ($(SINGLE_CORE_SLOTS_CLAIMED)) )

CONTINUE = ( $(SUSPEND) =!= True )

WANT_SUSPEND = ($(WANT_SUSPEND)) || ($(SUSPEND))

Instead of suspending the whole-machine job until single-cpu jobs finish, we could instead suspend single-core jobs while whole-machine jobs run. Replacing the SUSPEND expression above with the following would achieve this policy. Beware that this could cause single-core jobs to be suspended indefinitely if there is a steady supply of whole-machine jobs.

# Suspend single-core jobs while the whole-machine job runs
SUSPEND = ($(SUSPEND)) || ( \
  MY.IsWholeMachineSlot =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

Accounting

The above policies rely on job suspension. Should the jobs be "charged" for the time they spend in a suspended state? This affects the user's fair-share priority and the accumulated number of hours reported by condor_userprio. As of Condor 7.4, the default behavior is to charge the jobs for time they spend in a suspended state. There is a configuration variable, NEGOTIATOR_DISCOUNT_SUSPENDED_RESOURCES that can be used to get the opposite behavior.