{section: Autoclustering: SIGNIFICANT_ATTRIBUTES and Automatic Significant Attributes}

The following is old documentation for the
_SIGNIFICANT_ATTRIBUTES_ ("SA")
setting in HTCondor.  It's here for two reasons.  1. Some customers are using
SA, and you might need to deal with their RUST.  2. The automatic SA
detection (appearing in 6.7.15) basically does this automatically, so this
is a good introduction to the general principles. 3. If the automatic SA
generation fails, you might need to call back on explicit SA.

When attempting to match jobs, the negotiator basically asks the schedd,
"Let me see the next job", repeating until the schedd says, "No more jobs".
As an optimization, prior to SA if the negotiator could not match a given
job, the schedd would not return any more processes in that given cluster
for the current negotiation cycle.  So if 12.4 failed to match, the schedd
would simply not type to match 12.5, 12.6 and so on.  The theory is that
other procs in the same cluster likely have similar attributes and
Requirements, so if one fails to match the others likely will fail to match
as well, so why bother the negotiator with them?

Problem 1: Sometimes different procs in a cluster had different
attributes.  For example, 12.4 might fail to match because it's run (and
was interrupted) once and has generated a monstrous ImageSize.  Problem 2:
Identical jobs in different clusters would never be optimized this way.

Solution: Autoclustering.  Automatically create clusters of jobs that
are "identical" for purposes of matchmaking. The initial implementation was
_SIGNIFICANT_ATTRIBUTES_, in which the administrator had to identify
which attributes of a job were important for deciding that two jobs were,
for matchmaking purposes, identical.

The remainder of this document is heavily based on the original rough
documentation given to users of _SIGNIFICANT_ATTRIBUTES_.  There is
some out of date information, but the core design remains accurate.


_SIGNIFICANT_ATTRIBUTES_ is a list of attributes from a schedd's
job ads that define different jobs to use in negotiation. The
schedd maintains a list of "autoclusters", and then tries to put
each job into one of those autoclusters. For each job, it looks
at the values of the attributes listed in _SIGNFICANT_ATTRIBUTES_.
If all of those values match those of an existing autocluster, it
adds the job to an existing autocluster. If not, it creates a new
autocluster for that tuple of values.

To get the pre 6.7.15 behavior, just set

{code}SIGNIFICANT_ATTRIBUTES = ClusterId{endcode}

To make significant attributes useful, use the following formula:

1. Collect all _START_ expressions from the pool. Find all job
attributes that the _START_ expression references. ie:


{code}
Start = ((TARGET.ImageSize &lt;= ((My.Memory - 15) * 1024))

SIG_ATTRS_FROM_STARTD = ImageSize
{endcode}


2. Now add anything from the job ad referenced by the Central
Manager's _PreemptionRequirements_.

(_MY_ in _preemption_requirements_ is the startd ad of the machine
we're considering for preemption, _Target_ is the job ad of that
we're considering putting on the machine. Anything you want from
the job that's currently running on the machine you have to
export into the machine ad with _STARTD_JOB_EXPRS_)


{code}
Preemption_Requirements = (MY.ResearchGroupOfMachine == Target.ResGroupOfJob)

SIG_ATTRS_FROM_PREEMPTION = ResGroupOfJob
{endcode}


3. At the schedd, we're always going to want to have jobs with
different requirements be part of different AutoClusters.  (The
schedd may do this automatically in the future.)

{code}
SIG_ATTRS_OBVIOUSLY_NEEDED = Requirements
{endcode}


4. Now, find all references in the requirements of the job ads
for other things in the job ad, so two jobs that have the _same
expression_ for requirements but _different values_ for that
expression get placed in different autoclusters:


{code}
Requirements = (Target.HasSwapCkpt == TRUE &amp;&amp; (TARGET.Arch == "INTEL") \
    &amp;&amp; (TARGET.OpSys == "LINUX") &amp;&amp; (TARGET.Disk &gt;= MY.DiskUsage)

SIG_ATTRS_FROM_REQUIREMENTS = DiskUsage
{endcode}


We're now ready to create the list of significant attributes for the
schedd:

{code}
SIGNIFICANT_ATTRIBUTES = $(SIG_ATTRS_FROM_STARTD),\
                         $(SIG_ATTRS_FROM_PREEMPTION), \
                         $(SIG_ATTRS_OBVIOUSLY_NEEDED), \
                         $(SIG_ATTRS_FROM_REQUIREMENTS)
{endcode}


Note, if you make your significant attributes too restrictive
(you cluster jobs in ways that don't actually matter), you'll end
up with extra negotiation.  This is harmless, but reduces the
benefits of _SIGNIFICANT_ATTRIBUTES_.   "_SIGNIFICANT_ATTRIBUTES =
ClusterId_" is the extreme version of this.

If your significant attributes are too loose (you cluster
dissimilar jobs), HTCondor may skip over jobs that can in fact be
run.  If you follow the system above, you shouldn't run into
this.

Given the above, note that putting certain things in your
_SIGNIFICANT_ATTRIBUTES_ will lead to worst case behavior: every
single job gets its own cluster and zero optimization is performed.  This
includes any unique or highly individual attribute.  Things
that probably never belong in SA include _QDate, ClusterId, and
ProcId_.  Note that the automatical significant attribute finder in
6.7.15 will happily mark these as significant if they are present in the
expressions above.  Thus, avoid putting unique or highly unique things in
your machine expressions and the like.  For example, giving startds a
"_RANK=2000000000-QDate_" to try and prioritize older jobs will
cause worst case behavior.  It will work correctly, but slowly.