How to allow some jobs to claim the whole machine instead of one slot

Known to work with HTCondor version: 7.4

One way to achieve a machine that can run jobs that require a whole machine, and the simplest way to achieve a whole machine uses slot types for the machine. In the machine's configuration is the slot type definition:

SLOT_TYPE_1 = cpus=100%
NUM_SLOTS_TYPE_1 = 1

With that configuration, each machine advertises a single slot. However, this prevents the machine from supporting a mix of single-core and whole-machine jobs. The remainder of this explanation supports both single-core and whole-machine jobs.

NOTE: The following configuration is the old way of doing things with static slots. The new way to do things is to use dynamic partitionable slots. The section about that topic in the HTCondor manual is here.

A job needs two items. First, it needs to advertise itself as one that is a whole-machine job, so that the machine can identify the job as a whole-machine job for matchmaking. Second, the job needs to require that it be matched with a machine that runs whole machine jobs. For this example, the submit description file contains

+RequiresWholeMachine = True
Requirements          = (Target.CAN_RUN_WHOLE_MACHINE =?= True)

Then, configuration of the whole machine contains the following. Make sure it either comes after the attributes to which this policy appends (such as START), or that the definitions are merged.


# we will double-allocate resources to overlapping slots
NUM_CPUS = $(DETECTED_CORES)*2
MEMORY = $(DETECTED_MEMORY)*2

# single-core slots get 1 core each
SLOT_TYPE_1 = cpus=1
NUM_SLOTS_TYPE_1 = $(DETECTED_CORES)

# whole-machine slot gets as many cores and RAM as the machine has
SLOT_TYPE_2 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY)
NUM_SLOTS_TYPE_2 = 1

# Macro specifying the slot id of the whole-machine slot
# Example: on an 8-core machine, the whole-machine slot is 9.
WHOLE_MACHINE_SLOT = ($(DETECTED_CORES)+1)

# ClassAd attribute that is True/False depending on whether this slot is
# the whole-machine slot
CAN_RUN_WHOLE_MACHINE = SlotID == $(WHOLE_MACHINE_SLOT)
STARTD_EXPRS = $(STARTD_EXPRS) CAN_RUN_WHOLE_MACHINE

# advertise state of each slot as SlotX_State in ClassAds of all other slots
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) State

# Macro for referencing state of the whole-machine slot.
# Relies on eval(), which was added in HTCondor 7.3.2.
WHOLE_MACHINE_SLOT_STATE = \
  eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_State"))

# Macro that is true if any single-core slots are claimed
# WARNING: THERE MUST BE AN ENTRY FOR ALL SLOTS
# IN THE EXPRESSION BELOW.  If you have more slots, you must
# extend this expression to cover them.  If you have fewer
# slots, extra entries are harmless.
SINGLE_CORE_SLOTS_CLAIMED = \
  ($(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed") < \
  (Slot1_State =?= "Claimed") + \
  (Slot2_State =?= "Claimed") + \
  (Slot3_State =?= "Claimed") + \
  (Slot4_State =?= "Claimed") + \
  (Slot5_State =?= "Claimed") + \
  (Slot6_State =?= "Claimed") + \
  (Slot7_State =?= "Claimed") + \
  (Slot8_State =?= "Claimed") + \
  (Slot9_State =?= "Claimed") + \
  (Slot10_State =?= "Claimed") + \
  (Slot11_State =?= "Claimed") + \
  (Slot12_State =?= "Claimed") + \
  (Slot13_State =?= "Claimed") + \
  (Slot14_State =?= "Claimed") + \
  (Slot15_State =?= "Claimed") + \
  (Slot16_State =?= "Claimed") + \
  (Slot17_State =?= "Claimed") + \
  (Slot18_State =?= "Claimed") + \
  (Slot19_State =?= "Claimed") + \
  (Slot20_State =?= "Claimed") + \
  (Slot21_State =?= "Claimed") + \
  (Slot22_State =?= "Claimed") + \
  (Slot23_State =?= "Claimed") + \
  (Slot24_State =?= "Claimed") + \
  (Slot25_State =?= "Claimed") + \
  (Slot26_State =?= "Claimed") + \
  (Slot27_State =?= "Claimed") + \
  (Slot28_State =?= "Claimed") + \
  (Slot29_State =?= "Claimed") + \
  (Slot30_State =?= "Claimed") + \
  (Slot31_State =?= "Claimed") + \
  (Slot32_State =?= "Claimed") + \
  (Slot33_State =?= "Claimed")

# Single-core jobs must run on single-core slots
START_SINGLE_CORE_JOB = \
  TARGET.RequiresWholeMachine =!= True && MY.CAN_RUN_WHOLE_MACHINE == False && \
  $(WHOLE_MACHINE_SLOT_STATE) =!= "Claimed"

# Whole-machine jobs must run on the whole-machine slot
START_WHOLE_MACHINE_JOB = \
  TARGET.RequiresWholeMachine =?= True && MY.CAN_RUN_WHOLE_MACHINE

START = ($(START)) && ( \
  ($(START_SINGLE_CORE_JOB)) || \
  ($(START_WHOLE_MACHINE_JOB)) )

# Suspend the whole-machine job until single-core jobs finish.
SUSPEND = ($(SUSPEND)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE && ($(SINGLE_CORE_SLOTS_CLAIMED)) )

CONTINUE = ($(SUSPEND)) =!= True

WANT_SUSPEND = ($(WANT_SUSPEND)) || ($(SUSPEND))

# In case group-quotas are being used, trim down the size
# of the "pie" to avoid double-counting.
GROUP_DYNAMIC_MACH_CONSTRAINT = CAN_RUN_WHOLE_MACHINE == False

Instead of suspending the whole-machine job until single-cpu jobs finish, we could instead suspend single-core jobs while whole-machine jobs run. Replacing the SUSPEND expression above with the following would achieve this policy. Beware that this could cause single-core jobs to be suspended indefinitely if there is a steady supply of whole-machine jobs.

# Suspend single-core jobs while the whole-machine job runs
SUSPEND = ($(SUSPEND)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

Another possible policy is to suspend the whole machine job if there are any single core jobs but to kick the single-core jobs off if they take too long to finish. Starting with the original example policy, this can be achieved by adding the following to the configuration:

PREEMPT = ($(PREEMPT)) || ( \
  MY.CAN_RUN_WHOLE_MACHINE =!= True && $(WHOLE_MACHINE_SLOT_STATE) =?= "Claimed" )

# When a job is preempted, let it run for up to a day before killing it
MaxJobRetirementTime = 24*3600

Avoiding Delayed Suspension

In the first example policy above, the whole-machine job is suspended while single-core jobs finish running. This suspension happens within POLLING_INTERVAL (default 5) seconds of the job start. For many applications, being suspended a few seconds after startup and then resuming later is not a problem. However, this may be problematic for some applications: for example, a job that is intended to measure how much real time an algorithm takes to run.

One solution to this problem is to have the job sleep for at least POLLING_INTERVAL seconds before doing anything that should not be interrupted. Rather than requiring the job to be modified, the administrator can use USER_JOB_WRAPPER to do the sleeping before the job starts.

To accomplish that, the file whole_machine_job_wrapper should be created with the following contents. Adjust the setting of condor_history and sleep_time at the top of the script to be appropriate for your situation. Be sure to add execute and read permission to this file for all users. If you already have a USER_JOB_WRAPPER, you will need to merge together this new script with the existing one.

#!/bin/sh

# Set this to the full path to condor_history:
condor_history=condor_history

# Set this to something greater than POLLING_INTERVAL (default 5)
sleep_time=10

if ! [ -x "${condor_history}" ]; then
  echo "USER_JOB_WRAPPER has incorrect path to condor_history" 2>&1
  exit 1
fi

if ! [ -f "${_CONDOR_MACHINE_AD}" ]; then
  echo "USER_JOB_WRAPPER cannot fine machine ad" 2>&1
  exit 1
fi

is_whole_machine_slot=`"${condor_history}" -f "${_CONDOR_MACHINE_AD}" -format "%d\n" CAN_RUN_WHOLE_MACHINE`

if [ "${is_whole_machine_slot}" = "1" ]; then
  # Sleep for at least POLLING_INTERVAL seconds (default 5) before
  # starting whole-machine jobs so that if we are going to get
  # suspended while the machine drains, we suspend here before
  # starting the actual job.
  sleep $sleep_time
fi

exec "$@"

The configuration then needs to be modified to point to the full path of the job wrapper script.

USER_JOB_WRAPPER = /path/to/whole_machine_job_wrapper

"Sticky" Whole-machine slots

When a whole-machine job is about to start running, it must first wait for all single-core jobs to finish. If this "draining" of the single-core slots takes a long time, and some slots finish long before others, the machine is not efficiently utilized. Therefore, to maximize throughput, it is desirable to avoid frequent need for draining. One way to achieve this is to allow new whole-machine jobs to start running as soon as the previous whole-machine job finishes, without giving single-core jobs an opportunity to start in the same scheduling cycle. The trade-off in making the whole-machine slots "sticky" is that single-core jobs may get starved for resources if whole-machine jobs keep the system in whole-machine mode. This can be addressed by limiting the total number of slots that support whole-machine mode.

HTCondor automatically starts new jobs from the same user as long as CLAIM_WORKLIFE has not expired for the user's claim to the slot. Therefore, the stickiness of the whole-machine slot for a single user can be controlled most easily with CLAIM_WORKLIFE. To achieve stickiness across multiple whole-machine users, a different approach is required. The following policy achieves this by preventing single-core jobs from starting for 10 minutes after the whole-machine slot changes state. Therefore, when a whole-machine job finishes, there should be ample time for new whole-machine jobs to match to the slot before any single-core jobs are allowed to start.

# advertise when slots make state transitions as SlotX_EnteredCurrentState
STARTD_SLOT_EXPRS = $(STARTD_SLOT_EXPRS) EnteredCurrentState

# Macro for referencing EnteredCurrentState of the whole-machine slot.
# Relies on eval(), which was added in HTCondor 7.3.2.
WHOLE_MACHINE_SLOT_ENTERED_CURRENT_STATE = \
  eval(strcat("Slot",$(WHOLE_MACHINE_SLOT),"_EnteredCurrentState"))

# The following expression uses LastHeardFrom rather than time()
# because the former is stable throughout a matchmaking cycle, whereas
# the latter changes from moment to moment and therefore leads to
# unexpected behavior.
START_SINGLE_CORE_JOB = $(START_SINGLE_CORE_JOB) && \
  ( isUndefined($(WHOLE_MACHINE_SLOT_ENTERED_CURRENT_STATE)) || \
    isUndefined(MY.LastHeardFrom) || \
    MY.LastHeardFrom-$(WHOLE_MACHINE_SLOT_ENTERED_CURRENT_STATE)>600 )

Accounting and Monitoring

The above policies rely on job suspension. Should the jobs be "charged" for the time they spend in a suspended state? This affects the user's fair-share priority and the accumulated number of hours reported by condor_userprio. As of HTCondor 7.4, the default behavior is to charge the jobs for time they spend in a suspended state. There is a configuration variable, NEGOTIATOR_DISCOUNT_SUSPENDED_RESOURCES that can be used to get the opposite behavior.

Should the whole-machine slot charge more than the single-core slots? The policy for this is determined by SlotWeight. By default, this is equal to the number of cores associated with the slot, so usage reported in condor_userprio will count the whole-machine slot on an 8-core machine as 8 times the usage reported for a single-core slot.

There are a few things to keep in mind when interpreting the output of condor_status. The TotalCpus and TotalMemory attributes are double what the machine actually has, because we are double-counting these resources across the single-core and whole-machine slots.

When looking at condor_status, the extra slot representing the whole machine is visible. Notice that it appears in the Owner state, because the start expression we have configured is false unless presented with a whole-machine job. We could modify the IsOwner expression to make the slot appear as Unclaimed rather than Owner, but we find that Owner is a nice reminder that this slot is special and is not available to run "normal" jobs. In fact, it turns out that transitions between Owner and Unclaimed state have an additional important side-effect: they cause the machine to send an updated ad to the collector whenever the machine's willingness to run a job changes. Without this, it can take as long as UPDATE_INTERVAL for the collector to be updated.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@execute.node LINUX      X86_64 Unclaimed Idle     0.750  2006  0+00:00:04
slot2@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:05
slot3@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:06
slot4@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:07
slot5@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:08
slot6@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:09
slot7@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:10
slot8@execute.node LINUX      X86_64 Unclaimed Idle     0.000  2006  0+00:00:03
slot9@execute.node LINUX      X86_64 Owner     Idle     0.000  16054  0+00:00:08

To filter out the whole-machine slot, one could use a constraint such as the following:

condor_status -constraint 'CAN_RUN_WHOLE_MACHINE =!= True'

When the whole-machine slot is claimed, the other slots will appear in the Owner state, because they are configured to not start any new jobs while the whole-machine slot is claimed. Again, we could make them appear as Unclaimed by changing the IsOwner expression, but the Owner state serves as a useful reminder that these slots are reserved during this time, and it has the side-effect of forcing timely updates of the slot ad to the collector.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@execute.node LINUX      X86_64 Owner     Idle     0.870  2006  0+00:00:04
slot2@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:05
slot3@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:06
slot4@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:07
slot5@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:08
slot6@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:09
slot7@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:10
slot8@execute.node LINUX      X86_64 Owner     Idle     0.000  2006  0+00:00:03
slot9@execute.node LINUX      X86_64 Claimed   Busy     0.000  16054  0+00:00:04

Since the whole-machine policy depends on STARTD_SLOT_EXPRS, it can take a few iterations of ClassAd updates for the state of the slots to converge. For this reason, the information visible in condor_status may lag behind the true state of the slots when conditions have recently changed. It is therefore useful to use the -direct option to condor_status when verifying that the whole-machine policy is correctly working. As noted above, with IsOwner set in the default way so that slots enter the Owner state when they are unwilling to run new jobs, the information published in the collector will be updated much more quickly than if IsOwner is set to false. It is therefore highly recommended that IsOwner not be set to false.