Migration of CHTC from EL 6 to EL7
Every few years, when CHTC decides to migrate execute machines from one OS to another, we try to go to great effort to lessen the burden on users. This page tries to document the EL6 to EL7 migration to memorialize these issues for subsequent transitions in CHTC and for other sites.Requirements and Assumptions
The transition must be gradual, and staged, so that users with old jobs that cannot run on the old OS can run to completion on a subset of the pool at the same time that users with jobs that can only run on the new OS can make progress.
If a user specifies nothing extraordinary, their jobs will run on the "default" OS, which initially is the old, and at some point in time is switched to the new.
During the transition, some jobs will only run on the old os, some jobs only on the new, and some (many?) on both, and the user needs to be able to opt into any of these three cases.
This transition applies to
- Jobs submitted from CHTC-managed schedds, running on CHTC-managed execute nodes
- Jobs submitted from non-CHTC managed schedds, running on CHTC-managed execute nodes (jobs flocking in)
- Jobs submitted from CHTC-managed schedds, running on non-CHTC-managed execute nodes (jobs flocking out, and glideins)
The HTCondor religion is that execute nodes advertise what they are, and it is the responsibility of the jobs to select what the need. If a job selects the wrong OS to run on, that's on the job, the startd should not care. However, pragmatically, we may need to implement policy on the execute node in the case where it is the only place that CHTC has control.
The problem with defaulting
Ideally, we'd just ask the users to put a clause in their requirements expression, selecting a particular platform, e.g.
requirements = OpSysMajor == 7 or requirements = ((OpSysMajor == 7) || (OpSysMajor == 6))
However, there are a couple of problems with that. The first problem is defaulting. When a has requirements that don't mention OpSysMajor, the submit side needs to add in a clause picking one. This is what condor_submit does today with OpSys and Arch. This is hard-coded into the condor_submit executable, and is difficult to replicate in configuration, but it is possible, though ugly with the new submit transforms. We can simulate this today in config which looks for the string OpSysMajor in requirements, and if it is there, we can assume that the job is correctly selecting a version. One can construct all kinds of expressions for which this isn't true, but it should work in practice, as it does for OpSys and Arch.
Submit Requirements trickery
Using existing submit requirements, we can make the default "6" by configuring the schedd thus:
JOB_TRANSFORM_NAMES = EL, EL_VER JOB_TRANSFORM_EL @= end REQUIREMENTS Regexp("OpSysMajorVer",UnParse(Requirements),"i") =?= false && Regexp("\"WINDOWS\"", UnParse(Requirements)) =?= false SET Requirements (Target.OpSysMajorVer == 6) && $(MY.Requirements) SET AmTransformed true @end
We can even extract the version requested, and insert it into a standalone attribute by configuration like this:
JOB_TRANSFORM_EL_VER @= end [ eval_set_WantELVer = isError(int(Regexps("OpSysMajorVer == ([0-9])", UnParse(Requirements), "\\\\1", "i"))) ? 0 : int(Regexps("OpSysMajorVer == ([0-9])", UnParse(Requirements), "\\\\1", "i")); ] @end
With this configuration, a job that is submitted with a requirements statement like
Requirements = OpSysMajorVer == 7
is not changed at all, and only matches to a machine that runs EL7, but a job with no such clause has the requirements expression changed to look like
Requirements = (Target.OpSysMajorVer == 6) && ... AmTransformed = true
Note that relying on a regexp match of an UnParse is suboptimal. We can then implement in 8.7.x a classad ContainsAttr function that returns true iff the expression references (directly or indirectly) some named attribute. This would make the code more robust and simpler.
These expressions simplify the Requirements expression, and obey the principle of least surprise for our users -- when they request a OpSysMajorVer, no changes are made to their requirements.
Also note the EL_VER, which can be used by GlideinWMS to extract into a simple attribute which version is being requested. This could also be used in a docker, VM, or singularity environment to create a machine of a given type.
Schedds CHTC does not manage
This does not impact schedds that schedd does not manage. We could either ask them to implement this configuration, which presumes an 8.6 condor schedd. Or, we could implement policy on chtc execute nodes to only allow jobs from unmanaged schedds on a certain distro.