HTCondorWiki: How To Upgrade Condor Gracefully

Page History

How to upgrade Condor gracefully

Known to work with Condor version: 7.0

One way to gracefully upgrade is to shut down the pool, install the new version of Condor, and then start it back up. To do that, see How to shut down condor without killing jobs. However, before you do that, consider the consequence of waiting for jobs to finish. On multi-core machines, if all cores but one are idle, because you are waiting for a job to finish, this may be worse than killing everything and quickly restarting.

Another way to upgrade is to leave condor running. Condor will automatically restart itself if the condor_master binary is updated. To take advantage of this, configure Condor so that the path to binaries (e.g. MASTER) points to the new binaries. One way to do that (under unix) is to use a symlink that points to the current condor installation directory (e.g. /opt/condor). Once the new files are in place, change the symlink to point to the new directory. If condor is configured to locate its binaries via the symlink, then after the symlink changes, condor_master will notice the new binaries and restart itself. (How frequently it checks is controlled by

MASTER_CHECK_NEW_EXEC_INTERVAL

which defaults 5 minutes.)

When the master notices new binaries, it begins a graceful restart, which may not be exactly what you want. On an execute machine, a graceful restart means that running jobs are preempted. Standard universe jobs will attempt to checkpoint, which could be a problem if all machines in a large pool attempt to do this at the same time. If they do not complete within the cutoff time specified by the KILL policy expression (default 10 minutes), then they are killed without checkpointing. You may therefore want to increase this cutoff time and you may also want to upgrade the pool in stages rather than all at once.

For universes other than standard universe, jobs are preempted. If jobs have been guaranteed a certain amount of uninterrupted run time with MaxJobRetirementTime, then the job is not killed until the specified amount of retirement time has been exceeded (it's 0 by default). The first step of killing the job is a soft kill signal, which can be intercepted by the job so that it can shut down gracefully, save state etc. If the job has not gone away once the KILL expression fires (10 minutes by default), then it is forcibly hard-killed. Since graceful shutdown of jobs may rely on shared resources such as disks where state is saved, the same reasoning applies as for standard universe: you may want to increase the KILL time for large pools and you may want to upgrade the pool in stages to avoid jobs running out of time.

Another time limit to be aware of is the configuration setting

SHUTDOWN_GRACEFUL_TIMEOUT

. This defaults to 30 minutes. If the graceful restart is not completed within this time, a fast restart ensues. This causes jobs to be hard-killed.

How to check what version of Condor is running on all machines in a pool

On unix, the following is a handy way to summarize the Condor versions that exist in your pool:

condor_status -master -format "%s\n" CondorVersion | sort | uniq -c