HTCondorWiki: How To Run Self Checkpointing Jobs

Page History

How To Run Self-Checkpointing Jobs

This HOWTO describes how to use the features of HTCondor to run self-checkpointing jobs. This HOWTO is not about to checkpoint a given job; that's up to you or your software provider.

This HOWTO makes a number of assumptions about the job. These assumptions may not be true for your job; in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to match. If you can not, consult the 'Meeting the Expectations' and/or 'Other Options' sections, below.

Expectations

Your job can [be configured to] exit after taking a checkpoint with an exit code it does not otherwise use.
When restarted, your job determines on its own if a checkpoint is available, and if so, uses it.
Starting your job up from a checkpoint is relatively quick.
Your job can not otherwise communicate with HTCondor, or it does not atomically update its checkpoint(s).

Using +WantFTOnCheckpoint

[explain general flow of operations]

[concerns about checkpoint frequency]

Extended Example

FIXME

Debugging Checkpoints

FIXME

Meeting the Expectations

If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created. If so, this wrapper script could then return a unique value. If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set +SuccessCheckpointExitBySignal to TRUE and +SuccessCheckpointExitSignal to the particular signal. If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job itself.
If your job requires different arguments to start from a checkpoint, you can wrap your job in a script which checks for the presence of a checkpoint and runs the jobs with the corresponding arguments.
If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress. This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.
If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints. More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'. Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint. Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.

Future versions of HTCondor may provide for explicit coordination between the job and HTCondor. Modifying a job to explicitly coordinate with HTCondor would substantially alter the expectations.

Other Options

The other sections of this HOWTO explain how a job meeting this HOWTO's assumptions can take checkpoints at arbitrary intervals and transfer them back to the submit node. Although this is the method of operation most likely to result in an interrupted job continuing from a valid checkpoint, other, less reliable options exist.

Signals

If your job can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and then exit in a unique way, you may set +WantCheckpointSignal to TRUE, and +CheckpointSig to the particular signal. HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the SuccessCheckpoint job attributes, its files will be transferred and the job restarted, as usual. This method should be as reliable as spona

Reactive Checkpoints

Instead of taking a checkpoint at some interval, it is possible, for some specific interruptions, to instead take a checkpoint when interrupted. Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your job may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is preempted. This works like the previous section, except that you set KillSig to the particular signal, and the signal is only sent when your job is preempted. The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its KillSig; a job may request a longer delay than the machine's default by setting JobMaxVacateTime (but this will be capped by the administrator's setting).

You should probably only use this method of operation if your job runs on an HTCondor pool too old to support +WantFTOnCheckpoint, or the pool administrator has disallowed use of the feature (because it can be resource-intensive).