Ticket #2802: schedd not always reusing claims when it should

Found a bug in the schedd that causes it to release claims when it still has jobs it could run on those claims. The problem arises when a job no longer matches a slot after it completes - this is not all that uncommon using our defaults w/ dynamic slots.

The story is:

  1. a job id x completes.
  2. schedd goes through priorec array to try and find another job that matches the claimed resource. UNFORTUNATELY, it may happen that (a) the priorec has yet been rebuilt (i.e. using a cached priorec array), and (b) the job classad for complete job id x has not yet been destroyed, as it is waiting to be destroyed in the enqueueFinishedJob queue
  3. as a result of a and b, findrunnable job will try to match the very job that just completed with the now idle claim
  4. the job that just completed may no longer match with the claimed startd ad, especially in the case of submitting jobs w/ the defaults and using dynamic slots. For instance, machine.Memory < jobad.ImageSize, because ImageSize in the completed job is now bigger than when the resource was initially claimed (because now that the job is completed, ImageSize now reflects the real size seen by the starter and not just the size of the executable on disk).
  5. findrunnable job now marks the entire autocluster as not matching this claim
  6. the claim is relinquished when Todd's test really expected it should be reused :(
[Append remarks]

Remarks:

2012-Feb-01 11:01:32 by tannenba:
Pushed patch [30063] that fixes the problem by changing the order of our checks in schedd FindRunnableJob(). Previously, this method first tested that the job and machine ad still matched, then tested that the job is still runnable. I reversed the order - first test that the job is still runnable, then test that it matches. This prevents FindRunnableJob() from considering a job that just completed (step 3 above), which is what got us into trouble in the first place.


2012-Feb-01 13:09:05 by matt:
This is unlikely to impact anyone who is using RequestMemory. The default when not is very small, almost guaranteed to be overrun and trip this issue.


2012-Feb-03 15:45:53 by tstclair:
So in testing this at length, I've found there appears to be numerous times where we release a claim that could have been matched. I'm still tracing the logic to identify the cases.

I found that we never re-use a claim until it times out. I'm reopening as a result.


2012-Feb-06 14:41:45 by tstclair:
I'll reopen if once I can find a consistent repro method, moving on for now.
[Append remarks]

Properties:

Type: defect           Last Change: 2012-Mar-05 21:34
Status: resolved          Created: 2012-Feb-01 10:31
Fixed Version: v070705           Broken Version: v070600 
Priority:          Subsystem: Daemons 
Assigned To: tannenba           Derived From:  
Creator: tannenba  Rust:  
Customer Group: osg  Visibility: public 
Notify: matt@cs.wisc.edu, tstclair@redhat.com  Due Date:  

Related Check-ins:

2012-Mar-27 10:14   Check-in [31007]: complete version history edit. ===GT=== #2802 (By Karen Miller )
2012-Mar-26 15:38   Check-in [31004]: partial edit of version history item. ===GT=== #2802 (By Karen Miller )
2012-Mar-05 21:33   Check-in [30818]: Document #2802 (By Greg Thain )
2012-Feb-01 10:44   Check-in [30063]: fixed bug where schedd not always reusing claims when it should. #2802 (By Todd Tannenbaum )