{section: Autoclustering: SIGNIFICANT_ATTRIBUTES and Automatic Significant Attributes} The following is old documentation for the _SIGNIFICANT_ATTRIBUTES_ ("SA") setting in HTCondor. It's here for two reasons. 1. Some customers are using SA, and you might need to deal with their RUST. 2. The automatic SA detection (appearing in 6.7.15) basically does this automatically, so this is a good introduction to the general principles. 3. If the automatic SA generation fails, you might need to call back on explicit SA. When attempting to match jobs, the negotiator basically asks the schedd, "Let me see the next job", repeating until the schedd says, "No more jobs". As an optimization, prior to SA if the negotiator could not match a given job, the schedd would not return any more processes in that given cluster for the current negotiation cycle. So if 12.4 failed to match, the schedd would simply not type to match 12.5, 12.6 and so on. The theory is that other procs in the same cluster likely have similar attributes and Requirements, so if one fails to match the others likely will fail to match as well, so why bother the negotiator with them? Problem 1: Sometimes different procs in a cluster had different attributes. For example, 12.4 might fail to match because it's run (and was interrupted) once and has generated a monstrous ImageSize. Problem 2: Identical jobs in different clusters would never be optimized this way. Solution: Autoclustering. Automatically create clusters of jobs that are "identical" for purposes of matchmaking. The initial implementation was _SIGNIFICANT_ATTRIBUTES_, in which the administrator had to identify which attributes of a job were important for deciding that two jobs were, for matchmaking purposes, identical. The remainder of this document is heavily based on the original rough documentation given to users of _SIGNIFICANT_ATTRIBUTES_. There is some out of date information, but the core design remains accurate. _SIGNIFICANT_ATTRIBUTES_ is a list of attributes from a schedd's job ads that define different jobs to use in negotiation. The schedd maintains a list of "autoclusters", and then tries to put each job into one of those autoclusters. For each job, it looks at the values of the attributes listed in _SIGNFICANT_ATTRIBUTES_. If all of those values match those of an existing autocluster, it adds the job to an existing autocluster. If not, it creates a new autocluster for that tuple of values. To get the pre 6.7.15 behavior, just set {code}SIGNIFICANT_ATTRIBUTES = ClusterId{endcode} To make significant attributes useful, use the following formula: 1. Collect all _START_ expressions from the pool. Find all job attributes that the _START_ expression references. ie: {code} Start = ((TARGET.ImageSize <= ((My.Memory - 15) * 1024)) SIG_ATTRS_FROM_STARTD = ImageSize {endcode} 2. Now add anything from the job ad referenced by the Central Manager's _PreemptionRequirements_. (_MY_ in _preemption_requirements_ is the startd ad of the machine we're considering for preemption, _Target_ is the job ad of that we're considering putting on the machine. Anything you want from the job that's currently running on the machine you have to export into the machine ad with _STARTD_JOB_EXPRS_) {code} Preemption_Requirements = (MY.ResearchGroupOfMachine == Target.ResGroupOfJob) SIG_ATTRS_FROM_PREEMPTION = ResGroupOfJob {endcode} 3. At the schedd, we're always going to want to have jobs with different requirements be part of different AutoClusters. (The schedd may do this automatically in the future.) {code} SIG_ATTRS_OBVIOUSLY_NEEDED = Requirements {endcode} 4. Now, find all references in the requirements of the job ads for other things in the job ad, so two jobs that have the _same expression_ for requirements but _different values_ for that expression get placed in different autoclusters: {code} Requirements = (Target.HasSwapCkpt == TRUE && (TARGET.Arch == "INTEL") \ && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= MY.DiskUsage) SIG_ATTRS_FROM_REQUIREMENTS = DiskUsage {endcode} We're now ready to create the list of significant attributes for the schedd: {code} SIGNIFICANT_ATTRIBUTES = $(SIG_ATTRS_FROM_STARTD),\ $(SIG_ATTRS_FROM_PREEMPTION), \ $(SIG_ATTRS_OBVIOUSLY_NEEDED), \ $(SIG_ATTRS_FROM_REQUIREMENTS) {endcode} Note, if you make your significant attributes too restrictive (you cluster jobs in ways that don't actually matter), you'll end up with extra negotiation. This is harmless, but reduces the benefits of _SIGNIFICANT_ATTRIBUTES_. "_SIGNIFICANT_ATTRIBUTES = ClusterId_" is the extreme version of this. If your significant attributes are too loose (you cluster dissimilar jobs), HTCondor may skip over jobs that can in fact be run. If you follow the system above, you shouldn't run into this. Given the above, note that putting certain things in your _SIGNIFICANT_ATTRIBUTES_ will lead to worst case behavior: every single job gets its own cluster and zero optimization is performed. This includes any unique or highly individual attribute. Things that probably never belong in SA include _QDate, ClusterId, and ProcId_. Note that the automatical significant attribute finder in 6.7.15 will happily mark these as significant if they are present in the expressions above. Thus, avoid putting unique or highly unique things in your machine expressions and the like. For example, giving startds a "_RANK=2000000000-QDate_" to try and prioritize older jobs will cause worst case behavior. It will work correctly, but slowly.