Ticket #2540: Parallel Universe Job isn't being matched

The problem

LIGO has a problem concerning parallel universe.

The problem presents thusly:

WHen a parallel universe job is submitted. It stays idle. It does not accumulate any run time, ever. In looking at the Negotiator and Schedd log, the job is never matched. The only event which shows up in the job log is the submit event. In addition, other non dedicated scheduler universe jobs seem to match and run.

Here are some facts before we get to the actual error:

Fact 1

The submit machine version (which also matches the central manager):

[root@ldas-pcdev1 ~]# condor_version
$CondorVersion: 7.6.3 Aug 17 2011 BuildID: 361356 $
$CondorPlatform: x86_64_rhap_5 $

Fact 2

The requirements of the job failing to match:

[root@ldas-pcdev1 ~]# condor_q -format "Requirements = %s\n" Requirements
45649708.0
Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( (
RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.FileSystemDomain ==
MY.FileSystemDomain )

Fact 3

The start expressions of all of the machines:

[root@ldas-pcdev1 ~]# condor_status -format "Start = %s\n" start | sort | uniq
-c
     48 Start = ( ( TARGET.ImageSize < Memory * 1024 ) && ( ( TARGET.Owner ==
"jabadie" ) || ( TARGET.Owner == "boinc" ) ) )
     12 Start = ( OWNER == "boinc" )
   2524 Start = ( TARGET.ImageSize < Memory * 1024 )

Fact 4

The startds have:

DedicatedScheduler = "DedicatedScheduler@ldas-pcdev1.ligo.caltech.edu"
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler

And it appears that the dedicated scheduler is configured correctly. The right attributes end up in the right ads, and so on.

Fact 5

I just noticed this. The submit machine really has a lot of different names.

[psilord@ldas-pcdev1 ~]$ grep ldas-pcdev1 /etc/hosts
10.14.0.18      ldas-pcdev1.ligo.caltech.edu ldas-pcdev1 ldas-pcdev1i
[psilord@ldas-pcdev1 ~]$ nslookup ldas-pcdev1
Server:       131.215.125.1
Address:      131.215.125.1#53

Name:   ldas-pcdev1.ligo.caltech.edu
Address: 131.215.115.249
[psilord@ldas-pcdev1 ~]$ ifconfig
eth2      Link encap:Ethernet  HWaddr 00:14:4F:EB:09:12
          inet addr:10.14.0.18  Bcast:10.14.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3336884344 errors:0 dropped:98216 overruns:0 frame:0
          TX packets:2987543450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3622679421353 (3.2 TiB)  TX bytes:2195267492023 (1.9 TiB)

eth2:1    Link encap:Ethernet  HWaddr 00:14:4F:EB:09:12
          inet addr:10.9.0.18  Bcast:10.9.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth3      Link encap:Ethernet  HWaddr 00:14:4F:EB:09:13
          inet addr:131.215.115.249  Bcast:131.215.115.255  Mask:255.255.254.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:261131735 errors:0 dropped:0 overruns:0 frame:0
          TX packets:113737475 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:351274020933 (327.1 GiB)  TX bytes:104762178670 (97.5 GiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:84018233 errors:0 dropped:0 overruns:0 frame:0
          TX packets:84018233 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:16928138022 (15.7 GiB)  TX bytes:16928138022 (15.7 GiB)

The Problem

We see this in the negotiator log for the offending user in question:

10/04/11 15:48:15   Negotiating with DedicatedScheduler@ldas-pcdev1.ligo.caltech.edu at <10.14.0.18:44933>
10/04/11 15:48:15 0 seconds so far
10/04/11 15:48:15 Getting state information from the accountant
10/04/11 15:48:17 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd satya@ligo.
10/04/11 15:48:17 IO: Failed to read packet header
10/04/11 15:48:17     Failed to get reply from schedd
10/04/11 15:48:17   Error: Ignoring submitter for this cycle

And this in the schedlog:

10/04/11 15:48:17 (pid:9129) Negotiating for owner: satya@ligo
10/04/11 15:48:17 (pid:9129) Return from HandleReq <negotiate> (handler: 0.001s,sec: 0.000s)
10/04/11 15:48:17 (pid:9129) Return from Handler <DaemonCore::HandleReqSocketHandler> 0.0009s
10/04/11 15:48:17 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:17 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0004s
10/04/11 15:48:17 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:17 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0003s
10/04/11 15:48:18 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:18 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0004s
10/04/11 15:48:18 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:18 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0003s
10/04/11 15:48:18 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:18 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0001s
10/04/11 15:48:18 (pid:9129) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (4)
10/04/11 15:48:18 (pid:9129) Finished negotiating for satya in local pool: 0 matched, 3 rejected
10/04/11 15:48:18 (pid:9129) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0001s

We also see in the schedlog these types of errors:


10/04/11 15:40:08 (pid:9129) Failed to send RESCHEDULE to negotiator virgo-pilot-server.kfki.hu:

...

gethostbyname for local negotiator (condor.ldas.cit) failed!  Aborting negotiation.

Hrm, I just noticed that DedicatedScheduler@ldas-pcdev1.ligo.caltech.edu at <10.14.0.18:44933> doesn't match the nslookup of ldas-pcdev1.ligo.caltech.edu from that self-same machine. Maybe the submit machine has different names based upon who looks it up? Could it be Condor is getting confused on how to talk to that machine?

[Append remarks]

Remarks:

2011-Oct-11 16:00:14 by gthain:
So, the problem is that in schedd::negotiator, we reverse lookup the name of the negotiator:

        Daemon negotiator (DT_NEGOTIATOR);
        char *negotiator_hostname = negotiator.fullHostname();
        if (!negotiator_hostname) {
            dprintf(D_ALWAYS, "Negotiator hostname lookup failed!\n");
            free(sig_attrs_from_cm);
            return (!(KEEP_STREAM));
        }
        hent = condor_gethostbyname(negotiator_hostname);
        if (!hent) {
            dprintf(D_ALWAYS, "gethostbyname for local negotiator (%s) failed!"
                    "  Aborting negotiation.\n", negotiator_hostname);
            free(sig_attrs_from_cm);
            return (!(KEEP_STREAM));
        }

This code is before we check to see if we need to divert to the ded sched. So, we need to figure out why the negotiator thinks its name is "condor.ldas.cit"


2011-Oct-12 14:10:22 by psilord:
It turned out the /etc/hosts files and DNS systems were not configured correctly at Caltech. They fixed the configuration and the jobs started running.

However, I'm going to look through the code a little bit to see if any changes could be made which would either alleviate or direct the condor admins faster to the problem.


2011-Oct-12 15:37:07 by psilord:
The code path in which this lies is derived from here: git log e765cb0b.

It turns out that by inspection of a running 7.6.3 condor pool, that the negotiator and schedd believe they are speaking a 7.5.4 or earlier protocol. This is very wrong. I'm checking with the original author of that compatibility code path to find out why it is horked.


2011-Oct-13 11:55:50 by psilord:
This one is complicated, but I figured it out.

It turns out that we added a new attribute to the submitter ads called "SubmitterTag". When this attribute is present, the negotiator (when negotiating for that submitter) would use a new protocol to the schedd which was created at the 7.5.4 release of condor. If this was not present it would use the pre 7.5.4 negotiation protocol.

So, what happens is this:

The negotiator pulls the submitter ads from the collector, grinds on them, and then contacts the schedd. While there is only one connection from the negotiator to the schedd, each submitter ad gets its own "conversation" on the socket. Each conversation is prefixed by a number which identifies the protocol version of the conversation.

For non-dedicated scheduler submitter ads, the "SubmitterTag" attribute was present, and hence used the new 7.5.4+ protocol.

For dedicated scheduler submitter ads, the "SubmitterTag" was not present due to coding error, and so it used the pre 7.5.4 protocol.

What exacerbated the bug was that the schedlog only printed out the FIRST conversation protocol identifier that it was presented. So, we never saw it change in the schedd log during each negotiation conversation and so assumed that the shcedd was using the newer 7.5.4+ protocol for ALL conversations. This assumption was false.

So, Dan Bradley is fixing the dedicated scheduler to add the SubmitterTag attribute, and I'm fixing the schedlog output to be far more clear about what it is doing for each submitterad.


2011-Oct-13 16:03:18 by psilord:
Dan and I wrote the code, it is checked in and documented. Resolving.
[Append remarks]

Properties:

Type: defect           Last Change: 2011-Oct-13 16:03
Status: resolved          Created: 2011-Oct-11 14:23
Fixed Version: v070604           Broken Version: v070603 
Priority:          Subsystem: Parallel 
Assigned To: psilord           Derived From:  
Creator: psilord  Rust:  
Customer Group: ligo  Visibility: public 
Notify: psilord@cs.wisc.edu, tstclair@redhat.com, dan@hep.wisc.edu  Due Date: 20111013 

Related Check-ins:

2011-Oct-13 15:07   Check-in [27814]: ===GT=== #2540 ===VersionHistory:None=== Log which negotiation protocol the Scheduler::negotiate() call was passed. Since we negotiate per submitter ads, each submitter ad can use a different negotiation protocol, suppose NEGOTIATE, or NEGOTIATE_WITH_SIGATTRS. It turns out that if we don't know which [...] (By Peter Keller )
2011-Oct-13 12:52   Check-in [27802]: Documented fix for dedicated scheduler negotiation. ===GT=== #2540 ===VersionHistory:Complete=== (By Dan Bradley )
2011-Oct-13 12:43   Check-in [27801]: Make dedicated scheduler use the same DNS-free method for determining negotiator identity as the main schedd. #2540 (By Dan Bradley )