HTCondorWiki: Hgq Design Doc

Page History

Group Quota Design

Motivating Scenarios

??? What's some good use cases here What didn't the old code do that the new code can?

Some questions we'd like customer use cases to address:

What is the semantic of accounting group quota?
- That is: what does a group quota regulate/limit?
- What is the 'unit' associated with a quota?
- http://erikerlandson.github.com/blog/2012/11/15/rethinking-the-semantics-of-group-quotas-and-slot-weights-claim-capacity-model/
What does it mean for groups to be in a hierarchy?
- How does a parent's quota relate to child quotas?
- How do 'sibling' groups relate to each other, their parent, and their children (if any)?

High Level Design and Definitions

The HGQ design is intended to allow administrator to restrict the aggregrate number of slots running jobs submitted by groups of users.

These sets of users are organized into hierarchical groups, with the "none" group being the name of the root. The admin is expected to assign a quota to every leaf and interior node in the tree, except for the root. The assigned quotas can be absolute numbers or a floating point number from 0 to 1, which represents a percentage of the immediate parent. If absolute, it represents a weighted number of slots, where the each slot is multiplied by a configurable weight, which defaults to number of cores. All groups named must be predeclared in the config file. Note the quota is independent of user priority.

Definintions

Can we get crisp definitions of each of the fields in the GroupEntry structure?

Here is some annotation from the meeting on fields that didn't already have in-code doc:

    // these are set from configuration
   string name;
    double config_quota;  // Could be static (>=1) or dynamic (0<x<1)
   bool static_quota; // Flag for if config_quota is static or dynamic
   bool accept_surplus; // true if this group will accept surplus
    bool autoregroup; // true if will participate in autoregroup phase

    // current usage information coming into this negotiation cycle
    double usage; // accountant's value for usage under thi sgroup
    ClassAdListDoesNotDeleteAds* submitterAds; // list of submitter ads under this group
    double priority; // group's priority from acct

Meaning of quota for "static" quota:

The static quota for a given group indicates the minimum number of machines/slots that group is expected to be allocated, given sufficient demand. The sum of the static quota for all the children nodes of any given parent must be less than or equal to the parent's static quota.

The sum of the children's static quota may be less than the parent. If so, the remainder is assigned to the parent.

For dynamic (proportional) quota

A dynamic (proportional) quota indicates the percentage of the parent's node resources the group is expected to be allocated, given sufficient demand. If the children of a node have proportional quota, each node then is assigned an absolute quota based on the proportion assigned to their parent's node.

The sum of all the sibling quota should be <= 1.0. (if not, they are normalized to 1 with a warning message)

specifying quotas

Each job then specifies what group it should be in with the "+AccountingGroup = "group_name.username" syntax.

quota terminology

Note: The term "quota" is overloaded. Sometimes in the code and documentation, it means "the amount assigned by the administrator to a group" (entry->config_quota). It may also be the value translated from configured quota to actual (possibly weighted) slot quantity (entry->quota). The quantity finally assigned to a group, after quota computation and surplus sharing and fractional-quota distribution, is referred to as 'allocated' (entry->allocated).

Algorithm

First, the code builds up a data structure which describes each group, it's position in the tree, the administratively configured quota, whether it is static or dynamic quota, whether this group accepts_surplus or autoregroup. For each group, the current weighted usage is fetched from the accountant, as is the current userprio. The number of running and idle jobs is copied from the submitter ad from each submitter, and summed into the corresponding group structure. Note that the number of running jobs also includes jobs running in flocked-to pools. Each group also contains a list of all the related submitter ads.

If autoregroup is on, the submitters are also appended to the root's list of submitter ads.

After (weighted) slot quotas are assigned to all the group entries, surplus sharing is computed for all groups in the hierarchy configured to accept surplus. Following surplus sharing, when slot weighting is not enabled, any fractional quota allocations are consolidated and distributed in a round robin fashion.

Surplus Sharing

The basic principle for surplus sharing is: surplus quota is distributed among sibling groups in proportion to assigned quota. For example, if group A has twice the quota of group B, group A will be awarded twice the surplus. Some additional points:

available surplus consists of any surplus shared from the level above in the hierarchy, plus any surplus coming up from sibling sub-trees
any groups with surplus sharing not enabled do not participate in surplus distribution
if a group does not need all of its potential surplus, any it does not use will be shared among remaining participating groups
the parent group of siblings participates in sharing, effectively as another sibling
any surplus unused after sharing among siblings (and parent) is sent up the hierarchy to be shared at the level above

Fractional Quota Consolidation

When slot weighting is not enabled, fractional quota values for groups are consolidated and distributed in round robin fashion to ensure that all quotas are integer values.

available remainder for consolidation consists of remainder coming from upper level in hierarchy, combined with any remainder coming up from sibling subtrees
remainders are not accepted by groups not accepting surplus
siblings having received remainder least recently are favored in round robin - siblings are ordered by time of last receipt of a remainder
remainder unused at a level is sent up to parent

The following steps are iterated GROUP_QUOTA_MAX_ALLOCATION_ROUNDS time, default of 3:

For each node in the graph, we assume that all running jobs associated with that group will stay running, and any idle jobs will match, and if this sum is more than the quota, we declare that number the surplus, and distribute the surplus bottom up to groups that are configured accept surplus. ??? hen, we sort the submitters in "starvation order", by GROUP_SORT_EXPR, defaults to the ratio of resources used to the quota (either weighted)?

Finally, we negotiate with each group in that order, with a quota limited as calculated above.

Questions

How common is it to have demand (submitters) in interior nodes? What about non-homogenous pools? Is there a way to do this without relying on the submitter ad's # of idle/running jobs? How should this behave in the face of flocking? Weighted slots?