Page History
Group Quota Design
Motivating Scenarios
??? What's some good use cases here What didn't the old code do that the new code can?
High Level Design and Definitions
The HGQ design is intended to allow administrator to restrict the aggregrate number of slots running jobs submitted by groups of users.
These sets of users are organized into hierarchical groups, with the "none" group being the name of the root. The admin is expected to assign a quota to every leaf and interior node in the tree, except for the root. The assigned quotas can be absolute numbers or a floating point number from 0 to 1, which represents a percentage of the immediate parent. If absolute, it represents a weighted number of slots, where the each slot is multiplied by a configurable weight, which defaults to number of cores. All groups named must be predeclared in the config file. Note the quota is independent of user priority.
Definintions
Can we get crisp definitions of each of the fields in the GroupEntry structure?
Meaning of quota for "static" quota:
The static quota for a given group indicates the minimum number of machine that group is expected to be allocated, given sufficient demand. The sum of the static quota for all the children nodes of any given parent must be less than or equal to the parent's static quota. The sum of the children's static quota may be less than the parent. If so, the extra will be apportioned only to the immediate children (?), regardless of setting of accept_surplus (?), and autoregroup -- proportioned how? proportionally to demand? user_prio?
For dynamic (proportional) quota
A dynamic (proportional) quota indicates the percentage of the parent's node resources the group is expected to be allocated, given sufficient demand. If the children of a node have proportional quota, each node then is assigned an absolute quota based on the proportion assigned to their parent's node. The sum of all the sibling quota should be < 1.0.
Each job then specifies what group it should be in with the "+AccountingGroup = "group_name.username" syntax.
Note: The term "quota" is overloaded. Sometimes in the code and documentation, it means "the amount assigned by the administrator to a group". Other times, it means "The amount assigned by the administrator, added to the leftover slots from other groups".
Algorithm
First, the code builds up a data structure which describes each group, it's position in the tree, the administratively configured quota, whether it is static or dynamic quota, whether this group accepts_surplus or autoregroup. For each group, the current weighted usage is fetched from the accountant, as is the current userprio. The number of running and idle jobs is copied from the submitter ad from each submitter, and summed into the corresponding group structure. Note that the number of running jobs also includes jobs running in flocked-to pools. Each group also contains a list of all the related submitter ads.
If autoregroup is on, the submitters are also appended to the root's list of submitter ads.
Then the code walks the tree, bottom up, and subtracts the sum of the child quota from each interior node on the way up. Essentially, this is syntatic sugar to all the admin to specify total values of quota for intermediate nodes instead of marginal values.
The following steps are iterated GROUP_QUOTA_MAX_ALLOCATION_ROUNDS time, default of 3:
For each node in the graph, we assume that all running jobs associated with that group will stay running, and any idle jobs will match, and if this sum is more than the quota, we declare that number the surplus, and distribute the surplus bottom up to groups that are configured accept surplus. ??? hen, we sort the submitters in "starvation order", by GROUP_SORT_EXPR, defaults to the ratio of resources used to the quota (either weighted)?
Finally, we negotiate with each group in that order, with a quota limited as calculated above.
Questions
How common is it to have demand (submitters) in interior nodes? What about non-homogenous pools? Is there a way to do this without relying on the submitter ad's # of idle/running jobs? How should this behave in the face of flocking? Weighted slots?