Ticket #2441: Condor startd on Windows can't run more than 8 jobs simultaneously

On a 40 core machine running Win2k8 SP2, a single STARTD can only run about 8 jobs simultaneously. after about 8 jobs are running the schedd reports that claim activation is failing.

This ticket was created to separate the relatively trivial bug if condor not detecting all 40 cpus from the more difficult scaleability issues. see #2381 for the trivial bug.

Note that some of the remarks below were moved from #2381 to here.

Ticket #2453 created to track work figuring out what the problem is.

[Append remarks]

Remarks:

2011-Aug-12 16:16:27 by tstclair:
just triaging so it's on the radar feel free to amend


2011-Aug-17 12:50:50 by tannenba:
Additional info / thoughts:

Cycle reports this problem did not happen in Win XP, 2k, 2k3. They think the problem may be correlated with the new networking stack that was first rolled out in vista/HPC Server/2k8, and they have definitely seen it with HPCServer 2k3/2k8, Win 2k8. On these platforms, Cycle reports that the problem can be observed with as few as 12 slots :(. The more slots, the more consistently it happens.

One possible (albeit cumbersome) immediate work-around could be to run multiple startds on the same machine, aka 4 startds w/ 8 slots instead of one startd with 32 slots. Perhaps a HOWTO article on how to configure multiple startds under one master would be helpful? The HOWOTO could explain how to configure (create log & exec directories, set STARTD_NAME, set master to start them), and also some of the pros/cons (e.g. some STARTD machine ad attributes, like NonCondorLoadAvg, are system-wide and will no longer make sense if you are running multiple startds).


2011-Aug-17 12:52:41 by tannenba:
This issue seems specific to Windows, as we have often run more than 12 slots on Linux w/o seeing this problem.


2011-Aug-17 12:56:33 by tannenba:
Random shot in the dark thought: a quick/dirty look at the logs and it appears the problem manifests when the schedd is activating claims to the startd, the startd times out. Activation will cause the startd to spawn a starter. One difference between Condor on Linux vs Windows is signaling between startd and starter happens via posix signals on Linux, but Windows uses daemonCore created CEDAR sockets causing us to hit a deadlock between schedd, startd, starter ?


2011-Aug-22 08:32:31 by tstclair:
What happens if they change _TIMEOUT_INVERVAL? on the schedd..

E.g. raise all boats just to see what happens.

2011-Aug-22 13:19:42 by ichesal:
I raised the shadow and startd timeouts to 20 -- no difference. And failures on the comm channel were just as fast.



2011-Oct-03 10:23:33 by ichesal:
I have an update for this ticket, the result of testing with multiple startd's on the 40-core machine.

TL;DR: Multiple startd's don't help the situation. Would you like to arrange access to the machine?

Throughout the tests a 5-startd setup on a 40-core machine was used. The configuration for the tests was:

# Fake more CPUs than Condor can count
NUM_CPUS = 40

# Tell the master to run the procd
MASTER.USE_PROCD = True

# Always run
START = True

# Define the first STARTD on this machine
STARTD1 = $(STARTD)
STARTD1_ARGS = -f -local-name S1
STARTD1_LOG = $(LOCAL_DIR)/log/StartdLog.S1
STARTD1_EXECUTE = $(LOCAL_DIR)/execute.S1
STARTD1_ENVIRONMENT = "_condor_STARTER_LOG=$(LOG)/StarterLog.S1"
STARTD.S1.NUM_SLOTS = 8
STARTD.S1.STARTD_NAME = S1
STARTD.S1.STARTD_LOG = $(STARTD1_LOG)
STARTD.S1.STARTD_EXEUTE = $(STARTD1_EXECUTE)
DAEMON_LIST = $(DAEMON_LIST), STARTD1

# Define the second STARTD on this machine
STARTD2 = $(STARTD)
STARTD2_ARGS = -f -local-name S2
STARTD2_LOG = $(LOCAL_DIR)/log/StartdLog.S2
STARTD2_EXECUTE = $(LOCAL_DIR)/execute.S2
STARTD2_ENVIRONMENT = "_condor_STARTER_LOG=$(LOG)/StarterLog.S2"
STARTD.S2.NUM_SLOTS = 8
STARTD.S2.STARTD_NAME = S2
STARTD.S2.STARTD_LOG = $(STARTD2_LOG)
STARTD.S2.STARTD_EXEUTE = $(STARTD2_EXECUTE)
DAEMON_LIST = $(DAEMON_LIST), STARTD2

# Define the third STARTD on this machine
STARTD3 = $(STARTD)
STARTD3_ARGS = -f -local-name S3
STARTD3_LOG = $(LOCAL_DIR)/log/StartdLog.S3
STARTD3_EXECUTE = $(LOCAL_DIR)/execute.S3
STARTD3_ENVIRONMENT = "_condor_STARTER_LOG=$(LOG)/StarterLog.S3"
STARTD.S3.NUM_SLOTS = 8
STARTD.S3.STARTD_NAME = S3
STARTD.S3.STARTD_LOG = $(STARTD3_LOG)
STARTD.S3.STARTD_EXEUTE = $(STARTD3_EXECUTE)
DAEMON_LIST = $(DAEMON_LIST), STARTD3

# Define the fourth STARTD on this machine
STARTD4 = $(STARTD)
STARTD4_ARGS = -f -local-name S4
STARTD4_LOG = $(LOCAL_DIR)/log/StartdLog.S4
STARTD4_EXECUTE = $(LOCAL_DIR)/execute.S4
STARTD4_ENVIRONMENT = "_condor_STARTER_LOG=$(LOG)/StarterLog.S4"
STARTD.S4.NUM_SLOTS = 8
STARTD.S4.STARTD_NAME = S4
STARTD.S4.STARTD_LOG = $(STARTD4_LOG)
STARTD.S4.STARTD_EXEUTE = $(STARTD4_EXECUTE)
DAEMON_LIST = $(DAEMON_LIST), STARTD4

# Define the fifth STARTD on this machine
STARTD5 = $(STARTD)
STARTD5_ARGS = -f -local-name S5
STARTD5_LOG = $(LOCAL_DIR)/log/StartdLog.S5
STARTD5_EXECUTE = $(LOCAL_DIR)/execute.S5
STARTD5_ENVIRONMENT = "_condor_STARTER_LOG=$(LOG)/StarterLog.S5"
STARTD.S5.NUM_SLOTS = 8
STARTD.S5.STARTD_NAME = S5
STARTD.S5.STARTD_LOG = $(STARTD5_LOG)
STARTD.S5.STARTD_EXEUTE = $(STARTD5_EXECUTE)
DAEMON_LIST = $(DAEMON_LIST), STARTD5

DC_DAEMON_LIST = +STARTD1 STARTD2 STARTD3 STARTD4 STARTD5

Jobs were sleep jobs set to run for one hour ensuring: 1) the CPUs were actually supposed to be quiet; 2) there was plenty of time for Condor to fill the machine before jobs started to complete.

Test Machine 1: Win2k8

Ran these tests with and without the procd enabled on the machine.

With the procd we could only get 5 jobs started (slot1 on each of the 5 startds ran a job, subsequent slots failed to start their jobs with a comm error recorded between the startd and starter). The CPU for the procd was intense. Using 100% of 1 CPU the entire time. The CPU of the 5 condor_starter processes was similarly intense the entire time.

Without the procd we could get a few more jobs running but the CPU of the condor_master, condor_startds and condor_starters was intense. Each of those using 100% of a CPU apiece.

Test Machine 2: Win2k8 R2

Ran these tests with and without the procd enabled on the machine.

With the procd we got 27 sleep jobs running (though condor_status showed 37 claimed CPUs on the machine). With the procd in use the condor_procd was using 100% CPU as was the condor_startds, but the condor_starters were at the expected 0% CPU level.

Without the procd we could only get ~5 jobs startd. The CPU use of the condor_master, condor_startds and condor_starters was 100% apiece. Jobs were running the slot1 slots for each startd.


2011-Oct-24 16:30:55 by ichesal:
The 7.6.5 preview from TJ is so close.

The good news:

Job startup and job termination -- both of those happen much faster than they ever have. Near instant when it works well.

For 40 sleep jobs Condor was able to take the machine from empty to all 40 slots running sleep jobs in a few seconds.

The bad news:

For 40 jobs that do actual CPU-bound processing Condor could only get 33 of the slots running jobs. It did fill them quickly, a few seconds at most before everything was running. The remaining 7 slots refused to run jobs. Run attempts failed because of interprocess I/O errors. These can be seen in the StartLog. I'm attaching all the log files as well as screen cap of the machine in a steady state, 33/40 slots occupied, so you can see over all CPU use from taskman. The configuration used on the machine is also contained in the zip file.

This is definitely a better performing Condor release. Even while it was showing interprocess I/O I could still run condor_status -direct against the machine and the startd would respond. In the 7.6.4 release this isn't possible.


2011-Oct-24 16:32:40 by ichesal:
Zip file with the logs and configs is too big to be attached. Here's a download link: http://dl.dropbox.com/u/870088/condor/condor-7.6.5-testing-logs.zip


2011-Nov-07 09:49:22 by ichesal:
I'm re-testing the load jobs with John's patched binary today. Greatly reduced log verbosity, no logging for the procd. I'll let you know how it goes.

- Ian


2011-Nov-11 12:40:05 by ichesal:
Hi guys,

I was able to re-run the 7.6.5-PRE tests. This time: logging left at the default level on the STARTD and turned off completely on the PROCD. The jobs were CPU-bound jobs that were expected to run for about an hour each. We left the machine up for about 10 minutes and steady state was 30/40 slots occupied by jobs. The other 10 slots would try to start jobs, fail, repeat.

Log files and the config files used are available here: http://dl.dropbox.com/u/870088/condor/condor-7.6.5-retest-logs.zip

Let me know if you need anything else.

Thanks!

- Ian

[Append remarks]

Properties:

Type: defect           Last Change: 2012-Dec-17 12:08
Status: resolved          Created: 2011-Sep-07 15:35
Fixed Version: v070604           Broken Version: v070600 
Priority:          Subsystem: Win32 
Assigned To: tannenba           Derived From: #2381
Creator: johnkn  Rust:  
Customer Group: other  Visibility: public 
Notify: tstclair@redhat.com, tannenba@cs.wisc.edu, ichesal@cyclecomputing.com  Due Date: 20111110 

Derived Tickets:

#2424   windows startd does not scale beyond 12 slots
#2453   Figure out why we can only run about 8 jobs on a windows exec node

Related Check-ins:

2011-Nov-07 08:56   Check-in [28143]: statistics to measure performance of process monitoring. #2441 ===VersionHistory:None=== not intended to be user visible stats (By John (TJ) Knoeller )
2011-Nov-07 08:56   Check-in [28142]: performance measurement code for PROCAPI. builds on *nix but mostly only measures performance of the windows version of PROCAPI. in support of #2441 ===VersionHistory:None=== no user visible changes (By John (TJ) Knoeller )
2011-Nov-04 16:12   Check-in [28134]: statistics to measure performance of process monitoring. #2441 ===VersionHistory:None=== not intended to be user visible stats (By John (TJ) Knoeller )
2011-Nov-04 16:06   Check-in [28133]: performance measurement code for PROCAPI. builds on *nix but mostly only measures performance of the windows version of PROCAPI. in support of #2441 ===VersionHistory:None=== no user visible changes (By John (TJ) Knoeller )
2011-Oct-20 09:27   Check-in [27913]: Change the windows implementation of CSysInfo::GetParentPID to use NtQueryInformationProcess rather than CreateToolhelp32Snapshot. This results in a 2000x performance improvement in ProcAPI::snapshot which in turn provides a partial fix for #2441 - claim activation fail when more than 8 jobs running [...] (By John (TJ) Knoeller )