Ticket #2706: Schedd should kill claim when lease expires

When the claim lease between the schedd and startd expires, the startd kills the job and leaves the claimed state. The schedd doesn't do anything. In some cases, the shadow will notice that its connection to the starter has closed and exit with an error code. But if the startd machine drops off the network without a trace, the shadow may not notice the dead connection for 2 hours. To avoid this situation, the schedd should check for expired claim leases and abort the claim (and the shadow) when it sees an expired lease.

Note: Having the schedd detect expired claim leases depends on the new startd-sends-alives protocol. Since this has been the default for a while and we want to remove the old schedd-sends-alives protocol, this dependency shouldn't be a problem.

[Append remarks]

Remarks:

2011-Dec-13 06:26:38 by matt:
Is this a bug in the Schedd's handling of STARTD_SENDS_ALIVES=TRUE?

When do you plan to remove STARTD_SENDS_ALIVE=FALSE?


2011-Dec-13 10:06:10 by jfrey:
I'd say it's a bug in STARTD_SENDS_ALIVES=TRUE. We want to put the fix into 7.6 if it's simple, which I believe it will be.

As for removing STARTD_SENDS_ALIVES=FALSE, I think we'll need to wait for 7.9. It was only in 7.5.4 that we changed the default from False to True and had the schedd alone decide which value to use.


2011-Dec-13 12:56:30 by jfrey:
I've attached a patch that makes the schedd relinquish a claim and kill the shadow if the claim lease expires. It works for claim that are running non-parallel jobs.

The patch also fixes a bug when the schedd has a mixture of startd-sends-alive and schedd-sends-alive jobs.


2011-Dec-15 14:05:28 by tannenba:
CODE REVIEW

Looks good, just some minor changes -

  1. add comment at inserted time(NULL) to explain it is to deal w/ mixed startd and schedd send alive pools.
  2. get time(NULL) out of the while loop; can just use "now"
  3. remove extraneous lease_duration > 0 check in last conditional, it is gratuitous
  4. make sure srec isn't NULL before de-refing it
  5. don't try and send vacate to the startd - the lease expired on that side as well, so it isn't needed, and further more the schedd prolly cannot talk to the startd now anyhow, so it is just a waste of precious bodily fluids... i mean schedd resources...


2011-Dec-15 15:12:04 by jfrey:
Patch committed and pushed with Todd's recommended changes.
[Append remarks]

Properties:

Type: defect           Last Change: 2011-Dec-15 15:12
Status: resolved          Created: 2011-Dec-12 17:24
Fixed Version: v070605           Broken Version: v070600 
Priority:          Subsystem: Daemons 
Assigned To: jfrey           Derived From:  
Creator: jfrey  Rust:  
Customer Group: other  Visibility: public 
Notify: matt@cs.wisc.edu tannenba@cs.wisc.edu  Due Date:  

Derived Tickets:

#2736   schedd tries to kill pid 0 and then excepts

Related Check-ins:

2011-Dec-19 10:56   Check-in [28843]: minor edit of version history item ===GT=== #2706 (By Karen Miller )
2011-Dec-15 15:10   Check-in [28824]: Document schedd kill claims on lease expiration. #2706 ===VersionHistory:Complete=== (By Jaime Frey )
2011-Dec-15 15:03   Check-in [28823]: Schedd now kills claims when the lease expires. #2706 With the newer startd-sends-alives protocol for the claim lease, the schedd knows when the lease has expired, but before now, it didn't do anything. If the execute machine drops off the network, the schedd can end up waiting up to two hours before [...] (By Jaime Frey )

Attachments: