Ticket #2563: bug in startd alive protocol when there are communication failures

Communication failure during the startd->schedd alive protocol can cause the startd to kill the job, because it incorrectly determines that the claim is no longer recognized by the schedd. I have observed this actually happen to jobs in the Wisconsin CMS T2.

startd log:

10/19/11 12:07:20 Calling Handler <Alive to schedd <128.104.29.235:54791>> (3)
10/19/11 12:07:20 slot1: Received Alive response of -1 from schedd <128.104.29.235:54791> job 1.0
10/19/11 12:07:20 slot1: State change: claim no longer recognized by the schedd - removing claim

schedd log, showing a timeout when reading a message from the startd:

10/19/11 12:07:15 (pid:93969) Received TCP command 441 (ALIVE) from unauthenticated@unmapped <128.104.29.235:55161>, access level READ
10/19/11 12:07:15 (pid:93969) Calling HandleReq <receive_startd_alive> (0) for command 441 (ALIVE) from unauthenticated@unmapped <128.104.29.
235:55161>
10/19/11 12:07:16 (pid:93969) condor_read(): timeout reading 5 bytes from <128.104.29.235:55161>.
10/19/11 12:07:16 (pid:93969) IO: Failed to read packet header
10/19/11 12:07:16 (pid:93969) Received startd keepalive for unknown claimid #...
10/19/11 12:07:16 (pid:93969) Return from HandleReq <receive_startd_alive> (handler: 1.000s, sec: 0.001s, payload: 0.000s)

The schedd command handler receive_startd_alive is not correctly detecting the error. It is going ahead and processing the command with claim_id=NULL. It therefore sends back a negative response to the startd, indicating that the claim is no longer recognized by the schedd.

How can this be? Why does the following code not detect the communication failure?

    s->get_secret(claim_id);    // must free this; CEDAR will malloc cuz claimid=NULL
    if ( !s->end_of_message() ) {
        if (claim_id) free(claim_id);
        return FALSE;
    }

The answer is that end_of_message() is returning true, because of the following code in end_of_message():

            if ( allow_empty_message_flag ) {
                allow_empty_message_flag = FALSE;
                return TRUE;
            }

allow_empty_message_flag hides the error! It basically makes the first end_of_message() after receipt of a command an unreliable indicator of success.

A fix for this specific case would be to check the return value of s->get_secret(). I am making that change for 7.6.

But can we do better? I think so. If any attempt to read from the socket automatically cleared the allow_empty_message_flag, then the schedd code would have correctly detected the error. I think this change should be made in 7.7, but, arguably, it could go into 7.6 instead.

Another issue raised by this incident is that the receive_startd_alive command handler is getting called without doing a non-blocking wait for the command payload to arrive. In 7.7, commands can be registered so that DaemonCore does a non-blocking wait, so in 7.7, I will use that mechanism for this command. In addition, we should probably not set allow_empty_message_flag for commands that are registered in this way, because we know that they expect to receive data.

[Append remarks]

Remarks:

2011-Oct-19 14:59:31 by danb:
I made the very specific fix in 7.6.4: [27897]. The more general fixes have all been done in the master branch.


2011-Oct-19 15:40:01 by bbockelm:
Reported the same issue here for the Nebraska CMS T2 for glideinWMS jobs:

https://savannah.cern.ch/support/?124115

(ignore the CCB issue, look lower in the ticket)


2012-Mar-13 15:29:44 by tannenba:
CODE REIVEW

Looks good, thanks for the patch. Resolving.

[Append remarks]

Properties:

Type: defect           Last Change: 2012-Mar-13 15:31
Status: resolved          Created: 2011-Oct-19 13:20
Fixed Version: v070604           Broken Version: v070600 
Priority:          Subsystem:  
Assigned To: tannenba           Derived From:  
Creator: danb  Rust:  
Customer Group: cms  Visibility: public 
Notify: dan@hep.wisc.edu,tannenba@cs.wisc.edu,gthain@cs.wisc.edu  Due Date:  

Related Check-ins:

2011-Oct-19 14:53   Check-in [27902]: Documented fix for startd->schedd alive protocol bug. #2563 ===GT=== #2563 ===VersionHistory:Complete=== (By Dan Bradley )
2011-Oct-19 14:32   Check-in [27897]: Detect failure to receive claim_id in startd->schedd alive protocol. #2563 The problem is that end_of_message() alone is not a reliable indicator of success, due to allow_empty_message_flag. (By Dan Bradley )
2011-Oct-19 14:27   Check-in [27895]: Clear allow_empty_message_flag when there is an attempt to read data. #2563 Also, get rid of muddy semantics of allow_empty_message_flag. Previously, it would both allow a missing message and ignore trailing junk in a message. (By Dan Bradley )
2011-Oct-19 14:27   Check-in [27896]: Register schedd commands with non-blocking wait for command message. #2563 (By Dan Bradley )
2011-Oct-19 14:27   Check-in [27894]: Avoid setting allow_empty_message_flag when we know the command has a payload. #2563 (By Dan Bradley )