Ticket #2440: HGQ improperly handling partitionable slots

The HGQ quota assignment does not properly account for partitionable slots with multiple cpus. When accept-surplus is disabled, it will result in slot starvation, as the quota will count a partitionable slot once instead of how many cpus it has available.

From Jon Thomas:

GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5

Simple case:

NUM_CPUS = 4
SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 2

Submit jobs to a. Users in will not use more than one slot.

Seems to hit in other configs too:

NUM_CPUS = 20
SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 10

Will only use 9 slots

NUM_CPUS = 20
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 5

Will only use 4 slots.
[Append remarks]

Remarks:

2011-Sep-03 13:44:34 by eje:
repro/test:

Using the following configuration, with both regular and partitionable slots:

NEGOTIATOR_DEBUG = D_FULLDEBUG
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
GROUP_QUOTA_MAX_ALLOCATION_ROUNDS = 1

GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5
GROUP_ACCEPT_SURPLUS = FALSE

NUM_CPUS = 22
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 5

SLOT_TYPE_2 = cpus=1
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 2

The unused slot profile should look like this:

$ svhist Machine _SlotType_ State Activity Cpus
      5 rorschach.localdomain | P | Unclaimed | Idle | 4
      2 rorschach.localdomain | X | Unclaimed | Idle | 1
      7 total

submit 12 jobs to group "a". The group should be allowed to run 11 jobs:

universe = vanilla
cmd = /bin/sleep
args = 1800
should_transfer_files = if_needed
when_to_transfer_output = on_exit
+AccountingGroup="a.user"
queue 12

Submit the above jsub file, and let several negotiator cycles pass. Before fix: the slots provided by partitionable slots are not properly accounted for, and slots are starved. "a" is only allowed to run 6 jobs, instead of its quota of 11:

$ qvhist AccountingGroup JobStatus
      6 a.user | 1
      6 a.user | 2
     12 total

After the fix, group "a" quota is filled, and 11 jobs are allowed to run:

$ qvhist AccountingGroup JobStatus
      1 a.user | 1
     11 a.user | 2
     12 total


Here is an additional test variation for the fixed version that exercises the code path for GROUP_DYNAMIC_MACH_CONSTRAINT / NEGOTIATOR_SLOT_POOLSIZE_CONSTRAINT

This configuration adds 42 slots, which will be ignored by the HGQ algorithms and by the job submission:

NEGOTIATOR_DEBUG = D_FULLDEBUG
SCHEDD_INTERVAL	= 15
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
GROUP_QUOTA_MAX_ALLOCATION_ROUNDS = 1

GROUP_NAMES = a, b
GROUP_QUOTA_DYNAMIC_a = 0.5
GROUP_QUOTA_DYNAMIC_b = 0.5
GROUP_ACCEPT_SURPLUS = FALSE

NUM_CPUS = 22+42

SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 5

SLOT_TYPE_2 = cpus=1
SLOT_TYPE_2_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 2

# I intend this slot's quota to be ignored by
# GROUP_DYNAMIC_MACH_CONSTRAINT / NEGOTIATOR_SLOT_POOLSIZE_CONSTRAINT
SLOT_TYPE_3 = cpus=42
SLOT_TYPE_3_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_3 = 1

# ignore slots of type 3 for quota counting
GROUP_DYNAMIC_MACH_CONSTRAINT = (Cpus < 42)

This version of the job also ignores the 42-cpu slot, so it does not get partitioned:

universe = vanilla
cmd = /bin/sleep
args = 1800
should_transfer_files = if_needed
when_to_transfer_output = on_exit
requirements = (Cpus < 42)
+AccountingGroup="a.user"
queue 12

Because the 42-cpu slot is ignored both by the HGQ algorithms and the job, we should see that group "a" is still allowed to run 11 jobs, as before:

$ condor_submit t.jsub
Submitting job(s)............
12 job(s) submitted to cluster 1.

$ qvhist AccountingGroup JobStatus
      1 a.user | 1
     11 a.user | 2
     12 total


2011-Sep-23 17:38:26 by danb:
Code review:

This patch conflicts quite a bit with a patch TJ recently committed on the master branch: #2277. You might have beaten him to the punch if I hadn't taken so long to review. Sorry!

The basic idea looks okay to me, though one wonders if it would be better to just simplify the negotiator by always using slot weights.

--Dan


2011-Sep-23 18:10:04 by eje:
For this particular situation, I don't think slot weights would work quite the right way, since in the case of partitionable slots you want a job to be able to peel off a single slot, and not worry about paying for a weight tagged to # of cpus.

I think we want partitionable slots to be "weighted" on the resource side (summing up available resources for allocation) but not on the consumption side (paying for weight when negotiating).

(I expected this might be a race condition with #2277, regardless)

[Append remarks]

Properties:

Type: defect           Last Change: 2011-Dec-07 15:56
Status: resolved          Created: 2011-Sep-02 18:07
Fixed Version: v070604           Broken Version:  
Priority:          Subsystem: Daemons 
Assigned To: eje           Derived From:  
Creator: eje  Rust:  
Customer Group: other  Visibility: public 
Notify: tstclair@redhat.com, jthomas@redhat.com, dan@hep.wisc.edu, eje@cs.wisc.edu  Due Date:  

Related Check-ins:

2011-Oct-04 10:05   Check-in [27607]: Improve version history item, which required adding index entries. ===GT=== #2440 (By Karen Miller )
2011-Sep-28 17:24   Check-in [27459]: Merged [27447], [27448], [27458], Merge V7_7_2-branch to master ===GT:Fixed=== #2440 (By Erik Erlandson )
2011-Sep-28 16:47   Check-in [27458]: Merged [27456], [27457], Merge branch V7_6-branch into V7_7_2-branch ===GT:Fixed=== #2440 (By Erik Erlandson )
2011-Sep-28 16:18   Check-in [27457]: ===VersionHistory:Completed=== ===GT=== #2440 (By Erik Erlandson )
2011-Sep-28 16:13   Check-in [27456]: Fix group quota assignment to correctly account for the effective number of slots provided by partitionable slots ===GT:Fixed=== #2440 (By Erik Erlandson )

Attachments: