Ticket #2317: Parallel Universe jobs get preempted upon renewal of job lease

The telling lines from the startd log are:

07/12/11 21:08:54 Calling Handler <Alive to schedd <>> (4)
07/12/11 21:08:54 slot1: State change: claim no longer recognized by the schedd - removing claim
07/12/11 21:08:54 slot1: Changing state and activity: Claimed/Busy -> Preempting/Killing

Condor's keep-alive protocol, which is used between a schedd and claimed startds so that a startd knows when a schedd disappears and vice versa, changed between v7.4.x and v7.6.x. Back in v7.4.x, the schedd sent alive packets via UDP to the startds. In v7.6.x, in order to improve scalability/reliability and to be nicer to firewalls, the startd sends alive data via a roundtrip TCP connection to the schedd. Unfortunately, it looks like this new alive protocol and parallel universe jobs do not play together nicely. :( .

We have an immediate work-around available - we can explicitly tell the schedd to use the old-style protocol by putting the following in the condor_config file seen by all schedds that will be submitting parallel universe jobs:


This simple config entry should hold the fort until we get this bug fixed.

[Add remarks]



Type: defect           Last Change: 2011-Aug-22 12:27
Status: resolved          Created: 2011-Jul-19 12:37
Fixed Version: v070603           Broken Version: v070600 
Priority:          Subsystem: Parallel 
Assigned To: gthain           Derived From:  
Creator: tannenba  Rust:  
Customer Group: ligo  Visibility: public 
Notify: tstclair@redhat.com, tannenba@cs.wisc.edu  Due Date: 20110816 

Related Check-ins:

2011-Aug-17 16:59   Check-in [26836]: Version history entry for parallel universe job lease bug. #2317 (By Todd Tannenbaum )
2011-Aug-17 16:51   Check-in [26835]: Fix bug where parallel jobs got preempted on job lease renewal. ===GT=== #2317 ===VersionHistory:Complete=== (By Todd Tannenbaum )
2011-Jul-19 12:40   Check-in [26073]: known bug and work around version history item added: parallel univ jobs preempted as job lease expires at 20 minutes ===GT=== #2317 (By Karen Miller )