Migration of CHTC from EL 6 to EL7

Every few years, when CHTC decides to migrate execute machines from one OS to another, we try to go to great effort to lessen the burden on users. This page tries to document the EL6 to EL7 migration to memorialize these issues for subsequent transitions in CHTC and for other sites.

Requirements and Assumptions

The transition must be gradual, and staged, so that users with old jobs that cannot run on the old OS can run to completion on a subset of the pool at the same time that users with jobs that can only run on the new OS can make progress.

If a user specifies nothing extraordinary, their jobs will run on the "default" OS, which initially is the old, and at some point in time is switched to the new.

During the transition, some jobs will only run on the old os, some jobs only on the new, and some (many?) on both, and the user needs to be able to opt into any of these three cases.

This transition applies to

The HTCondor religion is that execute nodes advertise what they are, and it is the responsibility of the jobs to select what the need. If a job selects the wrong OS to run on, that's on the job, the startd should not care. However, pragmatically, we may need to implement policy on the execute node in the case where it is the only place that CHTC has control.

The problem with defaulting

Ideally, we'd just ask the users to put a clause in their requirements expression, selecting a particular platform, e.g.

requirements = OpSysMajor == 7

or

requirements = ((OpSysMajor == 7) || (OpSysMajor == 6))

However, there are a couple of problems with that. The first problem is defaulting. When a has requirements that don't mention OpSysMajor, the submit side needs to add in a clause picking one. This is what condor_submit does today with OpSys and Arch. This is hard-coded into the condor_submit executable, and is difficult to replicate in configuration, but it is possible, though ugly with the new submit transforms. We can simulate this today in config which looks for the string OpSysMajor in requirements, and if it is there, we can assume that the job is correctly selecting a version. One can construct all kinds of expressions for which this isn't true, but it should work in practice, as it does for OpSys and Arch.

Submit Requirements trickery

Using existing submit requirements, we can make the default "6" by configuring the schedd thus:

JOB_TRANSFORM_NAMES = EL, EL_VER
JOB_TRANSFORM_EL @= end

REQUIREMENTS Regexp("OpSysMajorVer",UnParse(Requirements),"i") =?= false && Regexp("\"WINDOWS\"", UnParse(Requirements)) =?= false
SET Requirements (Target.OpSysMajorVer == 6) && $(MY.Requirements)
SET AmTransformed true

@end

We can even extract the version requested, and insert it into a standalone attribute by configuration like this:


JOB_TRANSFORM_EL_VER @= end
[
    eval_set_WantELVer = isError(int(Regexps("OpSysMajorVer == ([0-9])", UnParse(Requirements), "\\\\1", "i"))) ? 0 : int(Regexps("OpSysMajorVer == ([0-9])", UnParse(Requirements), "\\\\1", "i"));
]
@end

With this configuration, a job that is submitted with a requirements statement like

Requirements = OpSysMajorVer == 7

is not changed at all, and only matches to a machine that runs EL7, but a job with no such clause has the requirements expression changed to look like

Requirements = (Target.OpSysMajorVer == 6) && ...
AmTransformed = true

Note that relying on a regexp match of an UnParse is suboptimal. We can then implement in 8.7.x a classad ContainsAttr function that returns true iff the expression references (directly or indirectly) some named attribute. This would make the code more robust and simpler.

These expressions simplify the Requirements expression, and obey the principle of least surprise for our users -- when they request a OpSysMajorVer, no changes are made to their requirements.

Also note the EL_VER, which can be used by GlideinWMS to extract into a simple attribute which version is being requested. This could also be used in a docker, VM, or singularity environment to create a machine of a given type.

Schedds CHTC does not manage

This does not impact schedds that schedd does not manage. We could either ask them to implement this configuration, which presumes an 8.6 condor schedd. Or, we could implement policy on chtc execute nodes to only allow jobs from unmanaged schedds on a certain distro.