Ticket #2330: support pool defragmentation
Step 1: Add basic mechanisms for draining partitionable slots.
Step 2: Create a central daemon that implements a simple but useful pool defragmentation policy.
- [DONE] 2011-07-29 - Initial draft of plan with details of protocols, policies, and metrics.
- [DONE] 2011-08-12 - Initial draft of development plan (i.e. steps needing to be done and a time estimate).
- [DONE] 2011-08-26 Revision 2 of startd drain plan addressing issues in the remark below from tannenba aug-24-2011.
- [DONE] 2011-09-09 Revision 3 of startd drain plan to match Miron's requirements
- [DONE] 2011-09-20 Revision 4 of startd drain plan. The upper bound on eviction time is now explicit. Also, eviction of jobs is delayed in a coordinated way in order to reduce idle time.
- [DONE] 2011-09-22 Make a development plan for startd draining.
- [DONE] 2011-09-29 Add new draining commands to the startd and create a command-line tool that calls them. The commands are no-ops for now.
- [DONE] Make upper bound on time spent in
Preempting/Vacateexplicit. See #2536.
- [DONE] 2011-10-27 Implement transition to draining and cancellation of draining. No new jobs accepted (i.e. START=False), but no eviction yet.
- [DONE] 2011-10-31 Implement eviction. The graceful draining case must coordinate eviction to reduce idle time.
- [DONE] 2011-11-03 Implement Drained state.
- [DONE] 2011-11-03 Add support for auto-resume after defragmentation.
- [DONE] 2011-11-04 Add startd attributes that advertise draining badput and idle time and estimates of time to completion.
- [DONE] 2011-11-07 Add an automated test of draining.
- Draining mechanism is code complete. Checked in for 7.7.5.
- [DONE] 2011-11-09 Finish updating documentation.
- [DONE] 2011-11-15 Finish implementing and documenting condor_defrag.
2011-Aug-24 12:18:35 by tannenba:
For the record, here's what I think we concluded today in our chat on draining.
During draining, we still need most of the existing state machine, so it is awkward to introduce "draining" as an activity or a state. Therefore, we propose to add a 3rd independent state variable that indicates whether the startd is draining or not. It could be published as a boolean attributed named "Draining".
We talked about the semantics of the draining deadline. In the existing document, this was proposed to be the time after which claims transition to the retiring activity (i.e. the normal process by which the startd closes a claim). This means draining could take an arbitrary amount of time past the deadline, depending on the startd preemption policy. Not good.
We agreed that it would be more useful for the draining deadline to be the time at which we will finish vacating jobs. To accomplish this, the startd must advertise the earliest draining deadline that its policy will permit, and it will reject requests with shorter deadlines. The earliest possible draining deadline depends on MaxJobRetirementTime. I propose that the startd advertise CurrentMaxJobRetirementTime, just like it advertises CurrentRank. Once the startd has a draining deadline, it must modify its advertised MaxJobRetirementTime expression so that it never promises more retirement time than what remains before the deadline. This will prevent backfill jobs that require more retirement time from matching.
Currently, once MaxJobRetirementTime expires, the Preempting state is entered. How long preemption takes depends on WANT_VACATE and KILL. Again, this could be an arbitrary amount of time, so this could conflict with the draining deadline. The idea we talked about for addressing this was to change the semantics of MaxJobRetirementTime. Instead of this being the time at which preemption begins, it should be the time at which hard-killing begins.
2011-Aug-24 12:19:31 by danb:
I thought a bit more about this. I am concerned, because the proposed change provides a way to promise the job a certain total runtime, but it does not provide a way to promise a certain amount of vacate time. Perhaps we need both MaxJobRetirementTime and MaxJobVacateTime.
With MaxJobRetirementTime, the job can say that it wants less than what the machine is willing to offer (never mind about incentives for now). The same could be true for MaxJobVacateTime: e.g. if the job's SoftKillTime is less than the machine's MaxJobVacateTime. (If we followed the same convention as MaxJobRetirementTime, then instead of naming the job attribute SoftKillTime, it would be named the same as the machine attribute, MaxJobVacateTime.)
So under this proposal, I think we would get rid of WANT_VACATE and KILL. The function of these would be taken over by MaxJobVacateTime. We may not need/want to change the semantics of MaxJobRetirement time after all. Still need to think through some of the details.
2011-Sep-02 11:06:47 by danb:
After further discussion, Todd and I were less comfortable with revamping the preempting state in order to be able to predict an upper bound on how long it will take. We agreed to treat eviction time as negligible for now. So the draining deadline is the time at which eviction begins. In the future, this could be revisited.
I should also note that after our discussion, I realized that job suspension makes time of completion of
MaxJobRetirementTimeunknown in advance. I therefore adopted the stance that there should be a "graceful" deadline semantic that automatically extends to meet
MaxJobRetirementTimepromises. It seemed convenient to me for this behavior to apply both at the time of the request and later when/if job suspension takes place. Stricter deadline semantics could be added, but I thought we should at least start with "graceful".
2011-Sep-06 18:20:48 by danb:
Todd, Greg, and I (Dan) discussed the 2nd draft of the draining plan today.
We concluded that the proposed backfill toggle should be replaced with two things:
- a toggle that says whether early termination of draining should be accepted or not (i.e. what to do when the slots all happen to be idle). For the defragmentation daemon, we would expect early termination to be a good thing, whereas in the case of a coordinated pool-wide maintenance time, early termination might not be desired.
- a capability to specify arbitrary
DrainingAttrattributes in the draining request
2011-Sep-12 10:31:11 by danb:
Miron wanted a different design. A new version of the draining design doc has been attached that attempts to achieve this.
- no backfill
- only whole startd draining
- always drain as soon as possible (but provide option to honor retirement time)
- to deal with race conditions caused by stale information, have the startd block while the draining requester examines up-to-date information, rather than having the requester send a "staleness tolerance expression" in the request
- provide an explicit boolean option to prevent suspension from extending retirement indefinitely
Ideas to consider for the future:
- support draining subsets of the machine by having the client request how much it wants and leaving it up to the startd to choose the most effective way to satisfy the request
- support trivial backfill jobs that are preempted as soon as needed to finish draining (i.e. as soon as retirement promises to non-backfill jobs have completed). Never delay completion of draining because of backfill jobs.
2011-Sep-12 14:48:28 by danb:
Another discussion with Miron:
- must have an upper bound on the eviction time that is known in advance
- distinguish between checkpointable jobs and jobs that just wish to transfer back the sandbox; if a job is checkpointable, it should be evicted (and checkpointed) right away without waiting for retirement time
- Q from Dan: is the existing facility of the job opting out of retirement time sufficient to satisfy this point?
- information that applies to the whole machine should be published in a machine ad
- the partitionable slot ad may be acceptable to satisfy this point
Type: enhance Last Change: 2011-Nov-21 16:12 Status: resolved Created: 2011-Jul-22 17:35 Fixed Version: v070705 Broken Version: v070000 Priority: 2 Subsystem: Assigned To: danb Derived From: Creator: danb Rust: Customer Group: cms Visibility: public Notify: email@example.com,firstname.lastname@example.org,email@example.com,firstname.lastname@example.org,email@example.com Due Date:
|#2536||make upper bound on time spent in Preempting/Vacate explicit|
|2011-Dec-28 10:46||Check-in : last details for documenting condor_defrag, including some index entry changes ===GT=== #2330 (By Karen Miller )|
|2011-Dec-27 14:24||Check-in : A lot of minor edits and LaTeX changes to documentation about the new Drained state and the condor_defrag daemon (and associated ClassAd attributes and configuration variables). ===GT=== #2330 (By Karen Miller )|
|2011-Nov-30 13:56||Check-in : Fixed typos in defrag documentation. #2330 (By Dan Bradley )|
|2011-Nov-21 16:09||Check-in : editing changes associated with new condor_defrag work ===GT=== #2330 (By Karen Miller )|
|2011-Nov-21 14:58||Check-in : Added missing defragad.tex #2330 (By Dan Bradley )|
|2011-Nov-16 16:34||Check-in : Documented state and activity transitions relating to draining. #2330 (By Dan Bradley )|
|2011-Nov-16 14:34||Check-in : new diagram for machine activity/state transitions, given new Drained state ===GT=== #2330 (By Karen Miller )|
|2011-Nov-15 18:24||Check-in : Documented LastDrainStartTime. #2330 (By Dan Bradley )|
|2011-Nov-15 18:24||Check-in : Documented effect of suspension during draining on completion estimate. #2330 (By Dan Bradley )|
|2011-Nov-15 18:24||Check-in : Added LastDrainStartTime. #2330 (By Dan Bradley )|
|2011-Nov-15 18:05||Check-in : Documented state transitions related to the new Drained state. #2330 (By Dan Bradley )|
|2011-Nov-15 17:51||Check-in : Documented condor_defrag. #2330 Added section to give high-level view and configuration examples. Added section describing attributes in the daemon ad. (By Dan Bradley )|
|2011-Nov-15 13:54||Check-in : Added daemon ad for condor_defrag. #2330 Added some draining-related stats to this daemon ad. (By Dan Bradley )|
|2011-Nov-14 12:22||Check-in : Document updated condor_defrag defaults. #2330 (By Dan Bradley )|
|2011-Nov-14 12:22||Check-in : Filter out offline ads in condor_defrag. #2330 (By Dan Bradley )|
|2011-Nov-14 12:02||Check-in : forgot to commit .dia format version of diagram. ===GT=== #2330 (By Karen Miller )|
|2011-Nov-14 11:55||Check-in : Redraw new states diagram for figure 3.3. Remove old, and useless versions of the diagram. ===GT=== #2330 (By Karen Miller )|
|2011-Nov-11 17:26||Check-in : Documented condor_defrag. #2330 (By Dan Bradley )|
|2011-Nov-11 17:26||Check-in : Improved defaults and fixed a bug in condor_defrag. #2330 (By Dan Bradley )|
|2011-Nov-11 10:06||Check-in : Improve comments in defrag test code to make it easier to understand. #2330 (By Dan Bradley )|
|2011-Nov-11 09:54||Check-in : Fix defrag test for windows. #2330 It is simpler to use multiple personal condors than to safely pack multiple startds into one personal condor. (By Dan Bradley )|
|2011-Nov-10 19:42||Check-in : Added condor_defrag. #2330 (By Dan Bradley )|
|2011-Nov-10 19:42||Check-in : Added test of condor_defrag. #2330 (By Dan Bradley )|
|2011-Nov-09 11:11||Check-in : Documented new draining-related machine ad attributes. #2330 (By Dan Bradley )|
|2011-Nov-09 10:47||Check-in : Improvement to condor_drain docs. #2330 (By Dan Bradley )|
|2011-Nov-08 17:57||Check-in : Documented condor_drain. ===GT=== #2330 ===VersionHistory:Complete=== (By Dan Bradley )|
|2011-Nov-08 12:14||Check-in : fix typo ===GT=== #2330 (By Karen Miller )|
|2011-Nov-08 10:40||Check-in : Added cmd_drain test to list of windows tests. #2330 (By Dan Bradley )|
|2011-Nov-07 16:54||Check-in : Added test of condor_drain. #2330 (By Dan Bradley )|
|2011-Nov-07 16:54||Check-in : Documented Drained state. #2330 (By Dan Bradley )|
|2011-Nov-07 16:54||Check-in : Added support for draining the startd. #2330 (By Dan Bradley )|
|2011-Oct-17 19:00||Check-in : Added MachineMaxVacateTime and JobMaxVacateTime and WantGracefulRemoval. #2536 (parent ticket #2330) [...] (By Dan Bradley )|
- HTPC_7_7_Step_1_DesignDocument.docx 139664 bytes added by danb on 2011-Sep-08 20:17:26 UTC.
- DrainingDesignDoc.pdf 78771 bytes added by danb on 2011-Sep-22 20:08:43 UTC.
This replaces the plan put forward in HTPC_7_7_1_DesignDocument. The main difference is that there is no backfill and no "delayed" draining in the new plan.
- DefragDaemonDesignDoc.docx 135826 bytes added by danb on 2011-Nov-15 23:56:14 UTC.