HTCondorWiki: How To Run Self Checkpointing Jobs

 
 This HOWTO describes how to use the features of HTCondor to run self-checkpointing jobs.  This HOWTO is not about to checkpoint a given job; that's up to you or your software provider.
 
-This HOWTO focuses on using =+WantFTOnCheckpoint=, a submit command which causes HTCondor to do file transfer when the job checkpoints.  This feature makes a number of assumptions about the job, detailed in the 'Assumptions' section.  These assumptions may not be true for your job, but in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to make them true.  If you can not, consult the 'Working Around the Assumptions' and/or 'Other Options' sections.
+This HOWTO focuses on using =+WantFTOnCheckpoint=, a submit command which causes HTCondor to do file transfer when the job checkpoints.  This feature makes a number of assumptions about the job, detailed in the 'Assumptions' section.  These assumptions may not be true for your job, but in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to make them true.  If you can not, consult the _Working Around the Assumptions_ and/or _Other Options_ sections.
 
 {subsection: General Idea}
 
-The general idea of checkpointing is, of course, for your job to save its progress in such a way that it can resume its forward progress after being interrupted.  Interruptions in the HTCondor system generally take three different forms: temporary (eviction and preemption), permanent (=condor_rm=), and recoverable failures (network outages, software faults, hardware failures, being held).  Of course, no checkpoint system can recover from permanent failures, but =+WantFTOnCheckpoint= -- unlike much of the 'Other Options' section -- readily allows recovery from all other interruptions.
+The general idea of checkpointing is, of course, for your job to save its progress in such a way that it can resume its forward progress after being interrupted.  Interruptions in the HTCondor system generally take three different forms: temporary (eviction and preemption), permanent (=condor_rm=), and recoverable failures (network outages, software faults, hardware problems, being held).  Of course, no checkpoint system can recover from permanent failures, but =+WantFTOnCheckpoint= -- unlike many choices in the _Other Options_ section -- readily allows recovery from all other interruptions.
 
 The scenario runs as follows:
 
@@ -23,8 +23,8 @@
 *::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has transferred a checkpoint.
 3:  Starting your job up from a checkpoint is relatively quick.
 *::  If starting your job up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
-4:  Your job can not otherwise communicate with HTCondor, or it does not atomically update its checkpoint(s).
-*::  Because interruptions may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint.
+4:  Your job atomically update its checkpoint file(s).
+*::  Because eviction/preemption may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint at that time.
 
 {subsection: Using +WantFTOnCheckpoint}
 
@@ -36,8 +36,8 @@
 
 *:  How frequently your job "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  (Just because a job which writes checkpoints frequently doesn't mean it has to exit that often, but the details of writing or altering a script to do that are beyond the scope of this HOWTO.)
 *:  How frequently your job is interrupted.  For instance, if the max job lifetime on your pool is six hours, taking a checkpoint every three hours could easily result in losing almost (but not quite actually) three hours of work.
-*:  How long it takes to take and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Unfortunately, you generally want to checkpoint less frequently (if possible) when checkpoints take longer, so as to maintain efficient job progress, the longer between intervals, the more progress the job will lose when interrupted.  The timing of interruptions isn't (in general) predictable, and (in practice) varies from pool to pool, and can also only be predicated assuming its similarity to past experiments.  The big exception to this is above: max job runtimes.
-
+*:  How long it takes to take and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Unfortunately, you generally want to checkpoint less frequently (if possible) when checkpoints take longer, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.  The timing of interruptions isn't (in general) predictable, and (in practice) varies from pool to pool, and can also only be predicated assuming its similarity to past experiments.  The big exception to this is above: max job runtimes.
+*:  Your appetite for deadline risk vs your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if you are interrupted, you'll lose more progress.
 
 {subsection: Debugging Checkpoints}
 
@@ -45,10 +45,12 @@
 
 {subsection: Working Around the Assumptions}
 
-1:  If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job itself.
+1:  If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job.
 2:  If your job requires different arguments to start from a checkpoint, you can wrap your job in a script which checks for the presence of a checkpoint and runs the jobs with the corresponding arguments.
 3:  If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
-4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= or =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)
+4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.
+
+Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)
 
 Future versions of HTCondor may provide for explicit coordination between the job and HTCondor.  Modifying a job to explicitly coordinate with HTCondor would substantially alter the expectations.
 
@@ -64,7 +66,7 @@
 
 {subsubsection: Signals}
 
-If your job can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual.  This method should be as reliable as spona
+If your job can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual.
 
 {subsubsection: Reactive Checkpoints}