HTCondorWiki: How To Run Self Checkpointing Jobs

-{section: How To Run Self-Checkpointing Jobs}
-
-This HOWTO describes how to use the features of HTCondor to run self-checkpointing jobs.  This HOWTO is not about to checkpoint a given job; that's up to you or your software provider.
-
-This HOWTO focuses on using =+WantFTOnCheckpoint=, a submit command which causes HTCondor to do file transfer when a job indicates that it has taken a checkpoint by exiting in a specific way.  While there are other approaches, we recommend this one because it allows jobs to checkpoint and resume across as many different types of interruptions as possible, and because among the techniques that make the same claim, it's the simplest.
+This HOWTO is about writing jobs for an executable which periodically saves checkpoint information, and how to make HTCondor store that information safely, in case it's needed to restart the job.
 
-The =+WantFTOnCheckpoint= submit command makes a number of assumptions about the job, detailed in the 'Assumptions' section.  These assumptions may not be true for your job, but in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to make them true.  If you can not, consult the _Working Around the Assumptions_ section and/or the _Other Options_ section.
+This HOWTO is _not_ about how to checkpoint a given executable; that's up to you or your software provider.
 
-{subsection: General Idea}
-
-The general idea of checkpointing is, of course, for your job to save its progress in such a way that it can pick up where it left off after being interrupted.  Interruptions in the HTCondor system generally take three different forms: temporary (eviction and preemption), permanent (=condor_rm=), and recoverable failures (network outages, software faults, hardware problems, being held).  Of course, no checkpoint system can recover from permanent failures, but =+WantFTOnCheckpoint= -- unlike many choices in the _Other Options_ section -- readily allows recovery from the other types of interruptions.
-
-File transfer on checkpoint works in the following way:
+{section: How To Run Self-Checkpointing Jobs}
 
-1:  The job exits after taking a checkpoint with a unique exit code.  For example, the following Python script exits with code 77.
-{file: 77.py}
-import sys
-print("Now exiting with code 77.")
-sys.exit(77)
-{endfile}
-2:  HTCondor recognizes the unique exit code and does file transfer; this file transfer presently behaves as it would if the job were being evicted.  Thus, your job must set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.
-3:  After the job's =transfer_output_files= are successfully sent to the submit node (and are stored in the schedd's spool, as normal for file transfer on eviction), HTCondor restarts the job on the same machine in exactly the same way as it started it the first time.  The job does not become idle -- it stays in the running state while the checkpoint is transferred.
-4:  If something interrupts the job, HTCondor reschedules it as normal.  As for any job with =ON_EXIT_OR_EVICT= set, HTCondor will restart the job with the files stored in the schedd's spool.
+The best way to run self-checkpointing code is with the submit command =+WantFTOnCheckpoint=.  This causes HTCondor to transfer checkpoint files to the submit node when the executable indicates that it has finished writing them.  After file transfer is complete, HTCondor will restart the executable on the same machine and in the same sandbox.  This immediate transfer means that the checkpoint is available for restarting the job even if the job is interrupted in a way that doesn't allow for files to be transferred, or if file transfer doesn't complete in the time allowed.
 
-By storing the job's checkpoint on the submit node as soon as it is created, this method allows the job to resume even if the execute node becomes unavailable.  Transferring the checkpoint before the job is interrupted is also more reliable, not just because not all interruptions permit file transfer, but because job lifetimes are typically much longer than eviction deadlines, so slower transfers are much more likely to complete.
+Before a job can use =+WantFTOnCheckpoint=, its executable must meet a number of requirements.
 
-{subsection: Assumptions}
+{subsection: Requirements}
 
-1:  Your job exits after taking a checkpoint with an exit code it does not otherwise use.
-*::  If your job does not exit when it takes a checkpoint, HTCondor can not (currently) transfer its checkpoint.  If your job exits normally when it takes a checkpoint, HTCondor will not be able to tell the difference between taking a checkpoint and actually finishing; if the checkpoint code and the terminal exit code are the same, your job will never finish.  For example, the following Python script exits with code 77 half of the time, and code 0 half of the time.
-{file: 77-time.py}
-import sys
-import time
+Your self-checkpointing code may not meet all of the following requirements.  In many cases, however, you will be able to add a wrapper script, or modify an existing one, to meet these requirements.  (Thus, your "executable" may be a script, rather than the code that's writing the checkpoint.)  If you can not, consult the _Working Around the Assumptions_ section and/or the _Other Options_ section.
 
-if time.time() % 2 > 1:
-    sys.exit(77)
-else:
-    sys.exit(0)
-{endfile}
-2:  When restarted, your job determines on its own if a checkpoint is available, and if so, uses it.
+1:  Your executable exits after taking a checkpoint with an exit code it does not otherwise use.
+*::  If your executable does not exit when it takes a checkpoint, HTCondor will not transfer its checkpoint.  If your executable exits normally when it takes a checkpoint, HTCondor will not be able to tell the difference between taking a checkpoint and actually finishing; that is, if the checkpoint code and the terminal exit code are the same, your job will never finish.
+2:  When restarted, your executable determines on its own if a checkpoint is available, and if so, uses it.
 *::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has taken a checkpoint.
-3:  Starting your job up from a checkpoint is relatively quick.
-*::  If starting your job up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
-4:  Your job atomically updates its checkpoint file(s).
+3:  Starting your executable up from a checkpoint is relatively quick.
+*::  If starting your executable up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
+4:  Your executable atomically updates its checkpoint file(s).
 *::  Because eviction may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint when your job is evicted.
 
 {subsection: Using +WantFTOnCheckpoint}
 
-The following Python script is a toy example of checkpointing.  It counts from 0 to 10 (exclusive), sleeping for 10 seconds at each step.  It writes a checkpoint file consisting only of the next number after each nap, and exits with code 77 at count 3, 6, and 9.  It exits with code 0 when complete.
+The following Python script is a toy example of checkpointing.  It counts from 0 to 10 (exclusive), sleeping for 10 seconds at each step.  It writes a checkpoint file (containing the next number) after each nap, and exits with code 77 at count 3, 6, and 9.  It exits with code 0 when complete.
 
 {file: 77-ckpt.py}
 #!/usr/bin/env python
@@ -97,13 +74,13 @@
 queue 1
 {endfile}
 
-This script/submit file combination does not remove the "checkpoint file" generated for timestep 9 when the job completes.  This could be done in '77-ckpt.py' immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
+This example does not remove the "checkpoint file" generated for timestep 9 when the executable completes.  This could be done in =77-ckpt.py= immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
 
 {subsection: How Frequently to Checkpoint}
 
-Ballpark, your job should aim to checkpoint once an hour.  However, this depends on a number of factors:
+Ballpark, you should aim to checkpoint once an hour.  However, this depends on a number of factors:
 
-*:  How frequently your job "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  If your job writes checkpoints very frequently, you may only want to exit every tenth (or whatever) checkpoint.
+*:  How frequently your executable "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  If your job writes checkpoints very frequently, you may only want to exit every tenth (or whatever) checkpoint.
 *:  How frequently your job is interrupted.  For instance, if the max job lifetime on your pool is six hours, taking a checkpoint every three hours could easily result in losing almost (but not quite actually) three hours of work.
 *:  How long it takes to make and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Generally, when checkpoints take longer to make and/or transfer, you want to checkpoint less frequently, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.
 *:  Your appetite for deadline risk versus your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if the job is interrupted, you'll wait longer because the job will have to recompute more.
@@ -133,10 +110,10 @@
 
 {subsection: Working Around the Assumptions}
 
-1:  If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job.
-2:  If your job requires different arguments to start from a checkpoint, you can wrap your job in a script which checks for the presence of a checkpoint and runs the jobs with the corresponding arguments.
-3:  If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
-4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 77 when it's finished doing so, your wrapper script can check for exit code 77 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  For example, you could write the following wrapper for '77-ckpt.py':
+1:  If your executable can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script may be able to inspect the sandbox after the executable exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the executable can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your executable can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken, kill the executable, and then exit with a unique code.
+2:  If your executable requires different arguments to start from a checkpoint, you can wrap your executable in a script which checks for the presence of a checkpoint and starts the executable with the corresponding arguments.
+3:  If it takes too long to start your executable up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the amount of progress you lose when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
+4:  If your executable does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoint file(s) instead.  More specifically, if your executable writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 77 when it's finished doing so, your wrapper script can check for exit code 77 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  For example, you could write the following wrapper for '77-ckpt.py':
 {file: 77-atomic.sh}
 #!/usr/bin/bash
 # Remove the non-atomic checkpoint so it won't be used if we didn't create
@@ -158,17 +135,17 @@
 {endfile}
 and change =transfer_output_files= to =77.ckpt.atomic=.
 
-Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)
+Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax the requirement for atomic checkpoint udpates; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file successfully transfers back, or none of them do.)
 
-Future versions of HTCondor may provide for explicit coordination between the job and HTCondor.  Modifying a job to explicitly coordinate with HTCondor would substantially alter the assumptions.
+Future versions of HTCondor may provide for explicit coordination between the executable and HTCondor.  Explicitly coordinate with HTCondor would substantially alter these requirements.
 
 {subsection: Other Options}
 
-The preceding sections of this HOWTO explain how a job meeting this HOWTO's assumptions can take checkpoints at arbitrary intervals and transfer them back to the submit node.  Although this is the method of operation most likely to result in an interrupted job continuing from a valid checkpoint, other, less reliable options exist.
+The preceding sections of this HOWTO explain how a job meeting the requirements can take checkpoints at arbitrary intervals and transfer them back to the submit node.  Although this is the method of operation most likely to result in an interrupted job continuing from a valid checkpoint, other, less reliable options exist.
 
 {subsubsection: Delayed and Manual Transfers}
 
-If your job takes checkpoints but can not exit with a unique code when it does, you have two options.  The first is much simpler, but only preserves progress when your job is evicted (e.g., not when the machine shuts down or the network fails).  To ensure that the checkpoint(s) a job has already taken are transferred when evicted, set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=, and include the checkpoint file(s) in =transfer_output_files=.  All the other assumptions still apply, except that quick restarts may be less important if eviction is infrequent in your pool.
+If your executable takes checkpoints but can not exit with a unique code when it does, you have two options.  The first is much simpler, but only preserves progress when your job is evicted (e.g., not when the machine shuts down or the network fails).  To ensure that the checkpoint(s) a job has already taken are transferred when evicted, set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=, and include the checkpoint file(s) in =transfer_output_files=.  All the other assumptions still apply, except that quick restarts may be less important if eviction is infrequent in your pool.
 
 If your job can determine when it has successfully taken a checkpoint, but it can not stop when it does, or doing so is too expensive, it could instead transfer its checkpoints without HTCondor's help.  In most cases, this will involve using =condor_chirp= (by setting =+WantIOProxy= to =TRUE= and calling the =condor_chirp= command-line tool).  Your job would be responsible for fetching its own checkpoint file(s) on start-up.  (You could also create an empty checkpoint file and list it as part of =transfer_input_files=.)
 
@@ -176,17 +153,17 @@
 
 If you're not already familiar with the programmer's use of the word 'signals', skip this section and the next.
 
-If your job can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual.
+If your executable can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual for =+WantFTOnCheckpoint=.
 
 {subsubsection: Reactive Checkpoints}
 
 If you're not already familiar with the programmer's use of the 'signals',
 skip this section.
 
-Instead of taking a checkpoint at some interval, it is possible, for some types of interruption, to instead take a checkpoint when interrupted.  Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your job may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is evicted.  This works like the previous section, except that you set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal (that causes your job to checkpoint), and the signal is only sent when your job is preempted.  The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its =KillSig=; a job may request a longer delay than the machine's default by setting =JobMaxVacateTime= (but this will be capped by the administrator's setting).
+Instead of taking a checkpoint at some interval, it is possible, for some types of interruption, to instead take a checkpoint when interrupted.  Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your executable may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is evicted.  This works like the previous section, except that you set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal that causes your job to checkpoint, and the signal is only sent when your job is preempted.  The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its =KillSig=; a job may request a longer delay than the machine's default by setting =JobMaxVacateTime= (but this will be capped by the administrator's setting).
 
 You should probably only use this method of operation if your job runs on an HTCondor pool too old to support =+WantFTOnCheckpoint=, or the pool administrator has disallowed use of the feature (because it can be resource-intensive).
 
 {subsubsection: Early Checkpoint Exits}
 
-If your job's natural checkpoint interval is half or more of your pool's max job runtime, it may make sense to checkpoint and then immediately ask to be rescheduled, rather than lower your user priority doing work you know will be thrown away.  In this case, you can use the =OnExitRemove= job attribute to determine if your job should be rescheduled after exiting.  Don't set =ON_EXIT_OR_EVICT=, and don't set =+WantFTOnCheckpoint=; just have the job exit with a unique code after its checkpoint.
+If your executable's natural checkpoint interval is half or more of your pool's max job runtime, it may make sense to checkpoint and then immediately ask to be rescheduled, rather than lower your user priority doing work you know will be thrown away.  In this case, you can use the =OnExitRemove= job attribute to determine if your job should be rescheduled after exiting.  Don't set =ON_EXIT_OR_EVICT=, and don't set =+WantFTOnCheckpoint=; just have the job exit with a unique code after its checkpoint.