Ticket #3435: negotiator-side resource consumption with consumption policies

I've put together a draft implementation of two interlocking new capabilities:

The draft currently lives on this topic branch on my github fork:

I wrote up a discussion and demonstration of the proposed features here:

Motivation

Some advantages of negotiator-side consumption:

Configurable resource consumption policies:

Requirements

resource consumption in matchmaking:

consumption policies:

Proposal

Consumption policies

negotiator-side resource consumption

Testing

Testing needs to cover the following categories of behavior:

  1. respect for configured consumption policies
    1. validate resources deducted from parent resources (p-slots)
    2. granularity of resources allocated to claims (d-slots)
    3. any default policies, if we go that way
  2. negotiator side resource consumption
    1. achieving multiple matches per resource during a negotiator cycle
    2. correctness of resource management
    3. resource limits as a matching constraint
  3. compatability with legacy/mixed-pool resources
    1. static slots
    2. legacy p-slots

Effort

[Append remarks]

Remarks:


2013-Mar-05 10:58:27 by johnkn:
Bulk change of target version from v070904 to v070905 using ./ticket-target-mover.


2013-Apr-01 12:01:58 by tannenba:
I think we should merge this work into master near the start of the next developer series (aka v8.1.x) assuming the below conditions are met [discussed these items w/ both Miron and Matt]:

  1. Little more explanation work on the design doc. While Erik did write a design doc for this work (thank you!), it is still somewhat hand-wavy. For example, even I do not fully understand the point on all of the "motivation" bullets; this is bad. I attached it below as "erik_negotiator_plsot_spitting.docx".

  2. We convince ourselves there is a single knob to fully disable this execution path and have the code operate as it did before.

  3. We convince ourselves that this mechanism does not inhibit our longer term strategic plan to empower the schedd. (I think we are already good here, but we should think it through to be certain...).

  4. Usability - Some concerns over the number of knobs involved, making sure there are good defaults such that users do not have to consult the manual and do configuration in the typical case(s).


2013-Apr-01 15:03:43 by matt:
With inclusion in devel.next, it is critical that the semantics not be changed beyond the defined tests w/o explicit agreement.

Todd, which motivation bullets are not fully understood?


2013-Apr-01 18:50:28 by eje:
Regarding (2), the feature is "opt-in" -- so legacy behavior holds, unless a startd is deliberately configured with a consumption policy. The latest draft passed batlab, which I take as system-test verification, however it can also be verified during code review.


2013-Apr-01 18:59:05 by eje:
Regarding knobs and defaults, all the new knobs have defaults (default vals are described in the 'Proposal' section above), so the feature should be easy to configure.

Regarding future schedd roadmapping, there is nothing here that would prevent future schedd functionalities. In fact, the consumption policy logic might be tweaked in the future to allow a policy option like 'send entire p-slot to schedd' (and/or just use the legacy logic for that purpose)


2013-Apr-11 12:01:25 by gthain:
The current negotiator hierarchical group quota code doesn't work well with both accept_surplus and weighted slots. It seems like this patch doesn't address that problem.


2013-Apr-11 19:08:31 by eje:
The reason surplus sharing doesn't play well with generalized slot weights is that it has to draw an equivalence between jobs and 'weights' (at least, in its current incarnation).

A way to solve that unit analysis problem is to define slot weights so that slot weights represent units of 'jobs able to be run against slot':

http://erikerlandson.github.io/blog/2012/11/15/rethinking-the-semantics-of-group-quotas-and-slot-weights-claim-capacity-model/

So, this ticket "solves" the problem by enabling a consumption policy to be matched up with a slot weight in such a way as to make weights equal jobs, which makes it possible to ensure that surplus sharing logic plays properly with slot weights.

Really, this is just a generalization of the logic that was already present for counting Cpus:

int count_effective_slots(ClassAdListDoesNotDeleteAds& startdAds, ExprTree* constraint)


2013-Apr-11 19:20:54 by eje:
I had a request to verify that consumption policies can operate when schedd-side "leftover" slot splitting is also enabled.

My initial test was to create a consumption policy that granted the entire slot. My original theory was that the schedd could split that, however I had forgotten that I designed consumption policies to operate consistently across the negotiator and startd. So the initial negotiator match obtains its claim, that represents all the resource assets. The schedd has no problem attempting to further allocate leftovers, however there are no leftovers, as the first job acquires them all with its claim.

The first test did demonstrate that nothing pathological happens. It just means that the negotiator will tend to acquire the matches/claim first, and the schedd will generally have nothing further to do in that case.

However, I did do a further test where I declared two startds. Each startd has a p-slot, but only one has a consumption policy, and the other operates in traditional mode. This test demonstrates that the system can operate in "mixed pool" mode, where some startds advertise traditional slots, and others can advertise slots with consumption policies, and the system correctly handles both.

the configuration was:

# spoof some cores
NUM_CPUS = 5

STARTD.ST1.STARTD_LOG = $(LOG)/Startd_1_Log
STARTD.ST1.STARTD_NAME = st1
STARTD.ST1.ADDRESS_FILE = $(LOG)/.startd_1_address
STARTD_ST1_ARGS = -f -local-name ST1
STARTD_ST1 = $(STARTD)

STARTD.ST2.STARTD_LOG = $(LOG)/Startd_2_Log
STARTD.ST2.STARTD_NAME = st2
STARTD.ST2.ADDRESS_FILE = $(LOG)/.startd_2_address
STARTD_ST2_ARGS = -f -local-name ST2
STARTD_ST2 = $(STARTD)

USE_PROCD = FALSE
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD_ST1, STARTD_ST2

# configure an aggregate resource (p-slot) to consume
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = True
# declare multiple claims for negotiator to use
# may also use global: NUM_CLAIMS
SLOT_TYPE_1_NUM_CLAIMS = 20
NUM_SLOTS_TYPE_1 = 1

# turn on schedd-side claim splitting to test with a consumption policy
CLAIM_PARTITIONABLE_LEFTOVERS = True

# turn this off to demonstrate that consumption policy will handle this kind of logic
MUST_MODIFY_REQUEST_EXPRS = False

# configure a consumption policy.   This policy is modeled on
# current 'modify-request-exprs' defaults:
# "my" is resource ad, "target" is job ad
# startd-wide consumption policy config
# defaults for cpus/memory/disk consumption
STARTD.ST2.CONSUMPTION_POLICY = True

# a consumption policy where match consumes whole slot each time
STARTD.ST2.CONSUMPTION_CPUS = 2
STARTD.ST2.CONSUMPTION_MEMORY = 32
STARTD.ST2.CONSUMPTION_DISK = 128

# keep slot weights enabled for match costing
NEGOTIATOR_USE_SLOT_WEIGHTS = True

# weight used to derive match cost: W(before-consumption) - W(after-consumption)
SlotWeight = Cpus

# for simplicity, turn off preemption, caching, worklife
CLAIM_WORKLIFE=0
MAXJOBRETIREMENTTIME = 3600
PREEMPT = False
RANK = 0
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
NEGOTIATOR_MATCHLIST_CACHING = False

# verbose logging
ALL_DEBUG = D_FULLDEBUG

NEGOTIATOR_INTERVAL = 300
SCHEDD_INTERVAL	= 15

spin up the pool, and verify that there are two p-slots:

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@st1@localhos LINUX      X86_64 Unclaimed Idle      0.500 1897  0+00:00:04
slot1@st2@localhos LINUX      X86_64 Unclaimed Idle      0.460 1897  0+00:00:04
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     2     0       0         2       0          0        0

               Total     2     0       0         2       0          0        0

Now submit 10 jobs:

universe = vanilla
cmd = /bin/sleep
args = 300
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 10

You should see three match-events in the negotiator:

$ grep Matched NegotiatorLog
04/11/13 16:57:36       Matched 1.0 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:50396> slot1@st1@localhost
04/11/13 16:57:36       Matched 1.1 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:54095> slot1@st2@localhost
04/11/13 16:57:36       Matched 1.2 none.user0000@localdomain <10.0.1.3:52463> preempting none <10.0.1.3:54095> slot1@st2@localhost

Notice that only one match can occur against the "traditional" p-slot slot1@st1@localhost, but two matches are allowed against slot1@st2@localhost, since its consumption policy consumes 2 cpus per match and it has 5 cpus.

However, the schedd can use leftover-splitting on slot1@st2@localhost, so it will result in 7 total jobs running. Five on slot1@st1, and two against slot1@st2:

[eje@rorschach grid]$ cchist condor_q RemoteHost
      5 slot1@st1@localhost
      2 slot1@st2@localhost
      3 undefined
     10 total

2013-Apr-22 11:40:52 by johnkn:
Bulk change of target version from v070905 to v070906 using ./ticket-target-mover.


2013-Jun-14 11:18:04 by tannenba:
The merge of this code into master is pending that eje has time to work on the rest of the outstanding items in the design document; most notably regression tests, documentation. I can help with this, but the bulk of the effort will have to come from the contributor (eje).

Re regression tests, I think we should start by listing out five or so user scenarios that are anticipated. What user scenarios does this patch hopefully enable that the user could not do before? Off the top of my head (not saying these are ones we want, but to give the flavor):

Before documentation, we should think hard about terminology to prevent terminology bloat and other confusion. I.e., for user documentation purposes, do we really want to introduce a new term/concept ('resource consumption'), or should we describe it in broader strokes using already established terminology such as negotiator pslot awareness ?


2013-Jun-17 14:25:27 by tannenba:
IMO, this code was merged a bit prematurely (not sure why) from V8_X-gittrac_3681-branch to master branch. It was targeted for v8.1.2 on purpose (as opposed to v8.1.0 or v8.1.1) to allow time for code review and regression test work as per design. Looks like it was merged thinking it should be released in v8.1.0.


2013-Jul-17 09:53:03 by johnkn:
commit [36576] was causing frequent crashes of the STARTD in CHTC, so it was reverted from the V8_1_0-branch.

The crash appears to be caused by late references to deleted Claim objects via the r_claims member. this is a typical stack trace for the crash, it doesn't always crash here however.

/usr/lib64/condor/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x12d)[0x7f71b93a401d]
/usr/lib64/condor/libcondor_utils_8_1_0.so(+0x130992)[0x7f71b93b0992]
/lib64/libpthread.so.0[0x31d440f500]
/lib64/libc.so.6[0x31d413259f]
condor_startd(Resource::publish_private(compat_classad::ClassAd*)+0xdb)[0x4480fb]
condor_startd(Resource::publish_for_update(compat_classad::ClassAd*, compat_classad::ClassAd*)+0x44)[0x44b744]
condor_startd(Resource::do_update()+0x3f)[0x44ba9f]
/usr/lib64/condor/libcondor_utils_8_1_0.so(TimerManager::Timeout(int*, double*)+0x1a1)[0x7f71b94c2071]
/usr/lib64/condor/libcondor_utils_8_1_0.so(DaemonCore::Driver()+0x751)[0x7f71b94d7e41]
/usr/lib64/condor/libcondor_utils_8_1_0.so(dc_main(int, char**)+0x1234)[0x7f71b94c5ed4]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x31d401ecdd]

the design of the r_claims member is the problem here. It should not be a set of Claims, one of which is also pointed to by the other members of the Resource. A Claim is a heavy data structure, but the only thing needed here is a set of claim ID's that haven't yet been inserted into a Claim object. A StringList would be the appropriate data structure for this purpose.


2013-Jul-27 17:07:32 by eje:
new topic branch including fix for #3792 is: V8_1-gt3792-startd-crashes-fix

Also includes a tweak to accountant logic that enables consumption policies to work smoothly with both accounting groups and concurrency limits, and a new regression test for concurrency limits and accounting groups functionality with consumption policies.


2013-Aug-09 12:25:46 by johnkn:
REVIEW: The memory use after free problem (#3792) appears to be fixed, but in a way that is more wasteful and more fragile than recommended.

The r_claims member of the Resource object it still a set of Claim() objects rather than a StringList of claim ids. This is bad because a Claim is a heavy object. But it's use here is merely used as a holder for a ClaimId.id() for publication/comparison and later insertion into the current Claim object. Use of Claim here is here is dangerous because methods in the Claim object can call get_by_any_id() in the ResMgr object which can delete a Claim object depending on what ID is passed. This is not how you were told to fix the problem, and your alternative solution is more code, more memory usage and less stable than what was recommended.

changes to slot_builder.cpp and resattributes.h serve no purpose that I can determine.

The change of ClaimIdHash from a HashTable to a map of sets is also gratituous. It will be better to leave the type of ClaimIdHash alone and allow the right hand side to be the same as the contents of ATTR_CLAIM_ID_LIST. Many of the changes in the matchmaker exist only to account for the unnecessary change of the type of ClaimIdHash. This will also make changes to the matchmaker a bit more localized.


2013-Aug-13 16:20:38 by eje:
summary of some email discussion:

>    +The r_claims member of the Resource object it still a set of Claim()
>    objects rather than a =StringList= of claim ids. This is bad because a
>    Claim is a heavy object. But it's use here is merely used as a holder for
>    a ClaimId.id() for publication/comparison and later insertion into the
>    current Claim object.

Although claims are "heavy", they aren't enormous (my estimate was a few kB), and there aren't a very large number of them. It defaults to the number of cores, and in general represents the maximum number of expected jobs to be running on a machine. If #3435 isn't enabled, r_claims is empty. My chief motivation for doing the way I did was that it works using the current conventions and machinery -- that is, a slot stores its claim structure(s), and an incoming claim request initiates a ResMgr search over {slots}X{claims} for the one that matches the incoming request. So doing it this way only required a localized gengeralization to (optionally) searching an additional source of claims: r_claims. The fix on #3792 just cleaned up the logic to make sure that the same claim doesn't reside on both r_claims and r_cur simultaneously. Although this can induce a situation where claims could get constructed/destructed that aren't ever used, it's not something that happens frequently, or in a tight inner loop. The impact is small, and only happens if one wants to use the new consumption policy feature.

>    +changes to slot_builder.cpp and resattributes.h serve no purpose that I
>    can determine.

The purpose of those mods was to enable slots constructed with zero-valued resource assets. It's a feature of consumption policies that, since different resources can drive the policy depending on configuration, other resources may be zero, and never be consumed. (This property is also exercised on the #3435 regression tests).

>    +The change of =ClaimIdHash= from a =HashTable= to a map of sets is also
>    gratituous.

To support multiple matches against a single slot, it was necessary to generalize the rhs of the mapping to contain multiple claims. Since I had to modify the structure definition, I just took the opportunity to migrate to a standard STL container while I was doing it.

> It's the CURRENT claim object that that is being
> destroyed there.   The state machine that is the STARTD is very complex,
> so it's important to write code that doesn't make too many assumptions
> about calling order.

The reason this works cleanly is that for p-slots, r_cur doesn't actually represent a live claim. Instead, r_cur is handed off to a d-slot, and the d-slot is the actual holder of that claim. Then r_cur is replaced with a new claim-structure on the p-slot, in preparation for the next claim request. Effectively, I just added another bucket of claims that a p-slot can hand off to the d-slot.


2013-Oct-10 17:42:17 by tannenba:
Squash commit into master for v8.1.2. Documentation still needs work (no version history, nowhere does it explain what Consumption Policies actually are or how/why you'd want to use them, etc), so moving to docpending.


2013-Oct-22 11:22:51 by eje:
Here is a draft of Consumption Policies exposition: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ConsumptionPolicies
[Append remarks]

Properties:

Type: enhance           Last Change: 2013-Oct-31 15:14
Status: resolved          Created: 2013-Jan-11 13:36
Fixed Version: v080102           Broken Version: v070900 
Priority:          Subsystem: DaemonsCM 
Assigned To: smoler           Derived From: #2826
Creator: eje  Rust:  
Customer Group: other  Visibility: public 
Notify: matt@cs.wisc.edu, tstclair@redhat.com, eje@cs.wisc.edu, tannenba@cs.wisc.edu, gthain@cs.wisc.edu, dan@hep.wisc.edu  Due Date:  

Derived Tickets:

#3792   Claim handling for consumption policies causes crash in startd

Related Check-ins:

2013-Oct-29 13:58   Check-in [38013]: minimal documentation for new consumption policy feature #3435 (By Karen Miller )
2013-Oct-10 17:40   Check-in [37815]: Add support for Consumption Policies, aka negotiator slot splits. #3435 Squashed commit of the following: [...] (By Erik Erlandson )
2013-Oct-03 14:52   Check-in [37758]: Disable classad caching on consumption policy unit and system tests to workaround interaction between caching and calls to quantize() where 2nd arg is a list #3435 (By Erik Erlandson )
2013-Oct-01 16:43   Check-in [37740]: better consumption policy expressions for static slot emulation (#3435) (By Erik Erlandson )
2013-Oct-01 16:24   Check-in [37739]: Consumption policy based matches fail if CP expressions do not properly evaluate to a non-negative value, or if all CP expressions evaluate to zero (no resources of any kind consumed) (#3435) (By Erik Erlandson )
2013-Oct-01 16:23   Check-in [37738]: Fix for logic errs caused by optimization of job ad Requirements expr (#3435) (By Erik Erlandson )
2013-Oct-01 16:22   Check-in [37737]: Fix logic error in cp_override_requested() that could create RequestXXX attributes which did not originally exist (#3435) Conflicts: src/condor_utils/consumption_policy.cpp (By Erik Erlandson )
2013-Jul-27 16:35   Check-in [37026]: Support consumption policies and multiple matches against partitionable slots ===GT:Fixed=== #3435 #3792 (By Erik Erlandson )
2013-Jun-14 10:31   Check-in [36576]: Support for Consumption Policies and Negotiator Resource Consumption #3435 (By Erik Erlandson )