HTCondorWiki: How To Run Self Checkpointing Jobs

 
 {subsection: General Idea}
 
-The general idea of checkpointing is, of course, for your job to save its progress in such a way that it can resume its forward progress after being interrupted.  Interruptions in the HTCondor system generally take three different forms: temporary (eviction and preemption), permanent (=condor_rm=), and recoverable failures (network outages, software faults, hardware problems, being held).  Of course, no checkpoint system can recover from permanent failures, but =+WantFTOnCheckpoint= -- unlike many choices in the _Other Options_ section -- readily allows recovery from all other interruptions.
+The general idea of checkpointing is, of course, for your job to save its progress in such a way that it can pick up where it left off after being interrupted.  Interruptions in the HTCondor system generally take three different forms: temporary (eviction and preemption), permanent (=condor_rm=), and recoverable failures (network outages, software faults, hardware problems, being held).  Of course, no checkpoint system can recover from permanent failures, but =+WantFTOnCheckpoint= -- unlike many choices in the _Other Options_ section -- readily allows recovery from the other types of interruptions.
 
-The scenario runs as follows:
+File transfer on checkpoint works in the following way:
 
-1:  Your job exits after taking a checkpoint with a unique exit code.
-2:  HTCondor recognizes the unique exit code and does file transfer; this file transfer presently behaves as it would if the job were being evicted/preempted.  This implies that =when_to_transfer_output= is set to =ON_EXIT_OR_EVICT=.
-3:  After the job's =transfer_output_files= are successfully sent to the submit node (and are stored in the schedd's spool directory, as normal for file transfer on eviction/preempt), HTCondor restarts the job exactly as it started it the first time.
-4:  After something interrupts the job, HTCondor reschedules it as normal.  As usual, HTCondor will start the job exactly as it started it the first time, but instead of starting with a fresh copy of your =transfer_input_files=, the sandbox will instead be copied from the =transfer_output_files= stored on the submit node in the previous step.
+1:  The job exits after taking a checkpoint with a unique exit code.
+2:  HTCondor recognizes the unique exit code and does file transfer; this file transfer presently behaves as it would if the job were being evicted.  This implies that =when_to_transfer_output= is set to =ON_EXIT_OR_EVICT=.
+3:  After the job's =transfer_output_files= are successfully sent to the submit node (and are stored in the schedd's spool, as normal for file transfer on eviction), HTCondor restarts the job exactly as it started it the first time.
+4:  If something interrupts the job, HTCondor reschedules it as normal.  As for any job with =ON_EXIT_OR_EVICT= set, HTCondor will restart the job with the files stored in the schedd's spool.
+
+By storing the job's checkpoint on the submit node as soon as it is created, this method allows the job to resume even if the execute node becomes unavailable.  Transferring the checkpoint before the job is interrupted is also more reliable, not just because not all interruptions permit file transfer, but because job lifetimes are typically much longer than eviction deadlines, so slower transfers are much more likely to complete.
 
 {subsection: Assumptions}
 
 1:  Your job exits after taking a checkpoint with an exit code it does not otherwise use.
 *::  If your job does not exit when it takes a checkpoint, HTCondor can not (currently) transfer its checkpoint.  If your job does not exit with a unique code when it takes a checkpoint, HTCondor will transfer files and restart the job whenever the job exits with that code; if the checkpoint code and the terminal exit code are the same, your job will never finish.
 2:  When restarted, your job determines on its own if a checkpoint is available, and if so, uses it.
-*::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has transferred a checkpoint.
+*::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has taken a checkpoint.
 3:  Starting your job up from a checkpoint is relatively quick.
 *::  If starting your job up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
-4:  Your job atomically update its checkpoint file(s).
-*::  Because eviction/preemption may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint at that time.
+4:  Your job atomically updates its checkpoint file(s).
+*::  Because eviction may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint when your job is evicted.
 
 {subsection: Using +WantFTOnCheckpoint}
 
@@ -79,10 +81,10 @@
 
 Ballpark, your job should aim to checkpoint once an hour.  However, this depends on a number of factors:
 
-*:  How frequently your job "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  (Just because a job which writes checkpoints frequently doesn't mean it has to exit that often, but the details of writing or altering a script to do that are beyond the scope of this HOWTO.)
+*:  How frequently your job "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  If your job writes checkpoints very frequently, you may only want to exit every tenth (or whatever) checkpoint.
 *:  How frequently your job is interrupted.  For instance, if the max job lifetime on your pool is six hours, taking a checkpoint every three hours could easily result in losing almost (but not quite actually) three hours of work.
-*:  How long it takes to take and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Unfortunately, you generally want to checkpoint less frequently (if possible) when checkpoints take longer, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.  The timing of interruptions isn't (in general) predictable, and (in practice) varies from pool to pool, and can also only be predicated assuming its similarity to past experiments.  The big exception to this is above: max job runtimes.
-*:  Your appetite for deadline risk vs your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if you are interrupted, you'll lose more progress.
+*:  How long it takes to make and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Generally, when checkpoints take longer to make and/or transfer, you want to checkpoint less frequently, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.
+*:  Your appetite for deadline risk versus your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if the job is interrupted, you'll wait longer because the job will have to recompute more.
 
 {subsection: Debugging Self-Checkpointing Jobs}