Ticket #2681: parallel universe broken on dynamic slots (request_cpus is ignored)

Resolved ticket as this issue was addressed in #2808

[This is copied from two emails by Steffen Grunewald <Steffen.Grunewald@aei.mpg.de> to the CondorLIGO list.]

Getting closer with Parallel Universe on Dynamic slots... but still no cigar.

The setup consists of 5 4-core machines and some more 2-cores machines. All of them have been configured as single, partitionable slots. Preemption is forbidden completely. The rank definitions are as follows:

RANK = 0
NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory

I'd expect this to favour big machines over small ones (for Parallel jobs), and partially occupied ones over empty ones.

What I see with the following submit file, is quite different:

universe   = parallel
initialdir = /home/steffeng/tests/mpi/
executable = /home/steffeng/tests/mpi/mpitest
arguments  =  $(Process) $(NODE)
output     = out.$(NODE)
error      = err.$(NODE)
log        = log
notification = Never
on_exit_remove = (ExitBySignal == False) || ((ExitBySignal == True) && (ExitSignal != 11))
should_transfer_files = yes
when_to_transfer_output = on_exit
Requirements = ( TotalCpus == 4 )
request_memory = 500
machine_count = 10
(mpitest is the ubiquitous "MPI hello world" program trying to get rank and size from MPI_COMM_WORLD)

Condor version is 7.6.0 (and should include the fixes of ticket 986 which went into 7.5.6).

How can I debug this?

===

It turns out that request_cpus=n, independent of n, will result in one slot claimed per machine, as I could prove with "request_cpus=4" and "machine_count=4" which claimed a single slot on four machines, same as would "request_cpus=1" or "request_cpus=2" would have done.

"machine_count" obviously gets translated into the number of individual MPI jobs (nodes), and "request_cpus" would define the number of CPU cores assigned to each of them. It's my problem if the nodes don't know about multi-core on their own.

Apparently, dynamic slot provisioning doesn't work well with parallel universe yet.

As soon as I return to old-style slot splitting (NUM_SLOTS=4, cpu=1, memory=25%, etc.) I get the "proximity" I'm looking for - of machine_count=10, the first 4 nodes get sent to one node, 4 to the next one, 2 to another.

So I either do hard partitioning, and get proper MPI behaviour, or dynamic partitioning, and am able to run memory-hungry jobs. Unfortunately, the users have been asking for both (and the mix is unpredictable).

To add to the inconvenience, for each reconfig Condor has to be stopped completely on the machines affected.

Are there plans to make Condor more flexible? Using up as many dynamic slots as possible on the same machine would help a lot. In the manual, and everywhere else I looked, "dynamic slots" and "parallel universe" seem to be disjoint concepts...

BTW: If there was proper co-existence of dynamic slots and parallel universe, one would have to look for a N_P_J_R expression that yields best results for the parallel job while harming as little other jobs as possible - perhaps such a thing doesn't even exist if preemption is allowed? Without preemption things should be easier: - Rank by number of unclaimed CPUs? How to do that? have another machine ClassAd attribute UnclaimedCpus? I vaguely remember someone had come up with a huge ifThenElse construction to sum up the resources "bound" by claimed dynamic slots, but there should be a solution that still works for 64 cores...)

[Append remarks]

Remarks:

2011-Dec-02 16:36:34 by tstclair:
Could have sworn this has been fixed.


2011-Dec-09 08:56:04 by matt:
See also, https://bugzilla.redhat.com/show_bug.cgi?id=545423


2011-Dec-15 17:10:53 by eje:
Fixing this requires making some mods to the negotiator.
  1. Make the computation of available resources ("pie") more aware of partitionable slots (a sort of extension to the fix for #2440).
  2. To achieve the desired "packing" of machines, the negotiator loop needs logic to allow matching more than one job to a partitionable slot ad.

Currently the negotiation loop cycles through the slot ads, under the (incorrect) assumption that one job can be matched to one ad. Fixing this would also improve non-parallel matching performance, since a partitionable slot could match against more than one job per negotiation cycle.


2011-Dec-16 10:57:00 by eje:
Making the necessary changes to the negotiation logic seems a bit risky for stable series, especially on short notice and w/out better regression testing.

Changing this partitionable-slot based logic would also be a good opportunity for considering issues of code-paths around weighted/unweighted slots, and also the semantics of weighted slots combined with partitionable slots.


2012-Feb-10 14:47:41 by pfc:
Updated priority to match Todd's new scheme (1=fire, 2=soon, 3=time permitting, 4=not yet prioritized, 5=wishlist/ideas).


2012-Apr-14 08:42:45 by tannenba:
This development work in this ticket was addressed in #2808
[Append remarks]

Properties:

Type: defect           Last Change: 2012-Apr-14 08:45
Status: resolved          Created: 2011-Dec-01 21:44
Fixed Version: v070706           Broken Version: v070600 
Priority:          Subsystem: Parallel 
Assigned To: gthain           Derived From: #986
Creator: pfc  Rust:  
Customer Group: ligo  Visibility: public 
Notify: pfcouvar@syr.edu, Steffen.Grunewald@aei.mpg.de, tstclair@redhat.com, eje@cs.wisc.edu, tannenba@cs.wisc.edu, gthain@cs.wisc.edu, tstclair@redhat.com  Due Date: