How to bind multi-core Condor jobs to only run in one CPU socket

Known to work with HTCondor version: 8.2.8

Some multicore jobs are very sensitive to CPU cache performance, and have huge slowdowns if threads run on cores located on separate physical CPU cores on the same machine.

This slowdown can be so severe that some sites would rather have multi-core jobs sit idle rather than be scheduled to run across physical cores.

Condor can do this with static slots in a straight forward way. For partitionable slots, one can create one p-slot per physical cpu socket, and affinity lock the set of cores on those socket to those slots. Assume a two socket machine, with two cores per socket. Socket one has core ids 0 and 1, Socket two has core ids 2 and 3.

# Assume two sockets per machine
NUM_SLOTS_TYPE_1 = 2
SLOT_TYPE_1_PARTITIONABLE = true

# This example has 2 cores per socket
SLOT_TYPE_1 = cpus=2

# Turn on CPU affinity in condor
ENFORCE_CPU_AFFINITY = true

# Bind the cores to each slot -- note that you must figure out how your particular cpu numbers their cores and sockets.  I don't think this is standard, and I'm not sure how hyperthreads (if enabled) count.
SLOT1_CPU_AFFINITY=0,1
SLOT2_CPU_AFFINITY=2,3

To determine which core ids are bound to which sockets, the following perl script can be run.


$ perl -lne '
  $ARGV =~ m{/cpu(\d+)/};
  push @{$p{$_}}, $1;
  END {print "socket $_: ", join ",", sort {$a<=>$b} @{$p{$_}} for sort {$a<=>$b} keys %p}
' /sys/devices/system/cpu/cpu*/topology/physical_package_id

Whose output might look like:


socket 0: 0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23
socket 1: 8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31