HTCondorWiki: How To Run Self Checkpointing Jobs

-This HOWTO is about writing jobs for an executable which periodically saves checkpoint information, and how to make HTCondor store that information safely, in case it's needed to restart the job.
+This HOWTO is about writing jobs for an executable which periodically saves checkpoint information, and how to make HTCondor store that information safely, in case it's needed to continue the job on another machine (or at another time).
 
 This HOWTO is _not_ about how to checkpoint a given executable; that's up to you or your software provider.
 
 {section: How To Run Self-Checkpointing Jobs}
 
-The best way to run self-checkpointing code is with the submit command =+WantFTOnCheckpoint=.  This causes HTCondor to transfer checkpoint files to the submit node when the executable indicates that it has finished writing them.  After file transfer is complete, HTCondor will restart the executable on the same machine and in the same sandbox.  This immediate transfer means that the checkpoint is available for restarting the job even if the job is interrupted in a way that doesn't allow for files to be transferred, or if file transfer doesn't complete in the time allowed.
+The best way to run self-checkpointing code is to set =CheckpointExitCode= in your submit file.  If the =executable= exits with =CheckpointExitCode=, HTCondor will transfer the checkpoint to the submit node, and then immediately restart the =executable= in the same sandbox on the same machine, with same the =arguments=.  This immediate transfer makes the checkpoint available for continuing the job even if the job is interrupted in a way that doesn't allow for files to be transferred (e.g., power failure), or if the file transfer doesn't complete in the time allowed.
 
-Before a job can use =+WantFTOnCheckpoint=, its executable must meet a number of requirements.
+Before a job can use =CheckpointExitCode=, its =executable= must meet a number of requirements.
 
 {subsection: Requirements}
 
-Your self-checkpointing code may not meet all of the following requirements.  In many cases, however, you will be able to add a wrapper script, or modify an existing one, to meet these requirements.  (Thus, your "executable" may be a script, rather than the code that's writing the checkpoint.)  If you can not, consult the _Working Around the Assumptions_ section and/or the _Other Options_ section.
+Your self-checkpointing code may not meet all of the following requirements.  In many cases, however, you will be able to add a wrapper script, or modify an existing one, to meet these requirements.  (Thus, your =executable= may be a script, rather than the code that's writing the checkpoint.)  If you can not, consult the _Working Around the Assumptions_ section and/or the _Other Options_ section.
 
 1:  Your executable exits after taking a checkpoint with an exit code it does not otherwise use.
 *::  If your executable does not exit when it takes a checkpoint, HTCondor will not transfer its checkpoint.  If your executable exits normally when it takes a checkpoint, HTCondor will not be able to tell the difference between taking a checkpoint and actually finishing; that is, if the checkpoint code and the terminal exit code are the same, your job will never finish.
@@ -18,12 +18,10 @@
 *::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has taken a checkpoint.
 3:  Starting your executable up from a checkpoint is relatively quick.
 *::  If starting your executable up from a checkpoint is relatively slow, your job may not run efficiently enough to be useful, depending on the frequency of checkpoints and interruptions.
-4:  Your executable atomically updates its checkpoint file(s).
-*::  Because eviction may occur at any time, if your job does not update its checkpoints atomically, HTCondor may transfer a partially-updated checkpoint when your job is evicted.
 
-{subsection: Using +WantFTOnCheckpoint}
+{subsection: Using CheckpointExitCode}
 
-The following Python script is a toy example of checkpointing.  It counts from 0 to 10 (exclusive), sleeping for 10 seconds at each step.  It writes a checkpoint file (containing the next number) after each nap, and exits with code 77 at count 3, 6, and 9.  It exits with code 0 when complete.
+The following Python script is a toy example of code that checkpoints itself.  It counts from 0 to 10 (exclusive), sleeping for 10 seconds at each step.  It writes a checkpoint file (containing the next number) after each nap, and exits with code 77 at count 3, 6, and 9.  It exits with code 0 when complete.
 
 {file: 77-ckpt.py}
 #!/usr/bin/env python
@@ -52,20 +50,15 @@
 sys.exit(0)
 {endfile}
 
-The following submit file commands HTCondor to transfer the file '77.ckpt' to the submit node whenever the script exits with code 77.  If interrupted, the job will resume from the most recent of those checkpoints.
+The following submit file commands HTCondor to transfer the file '77.ckpt' to the submit node whenever the script exits with code 77.  If interrupted, the job will resume from the most recent of those checkpoints.  Note that you _must_ include your checkpoint file(s) in =transfer_output_files=; otherwise HTCondor will not transfer it (them).
 
 {file: 77.submit}
-# You must set both of these things to do file transfer when you exit...
-+WantFTOnCheckpoint         = TRUE
-when_to_transfer_output     = ON_EXIT_OR_EVICT
-# ... with code 77.
-+SuccessCheckpointExitCode  = 77
-# You must include your checkpoint file in transfer_output_files.
+CheckpointExitCode          = 77
 transfer_output_files       = 77.ckpt
+should_transfer_files       = yes
 
 executable                  = 77-ckpt.py
 arguments                   =
-should_transfer_files       = yes
 
 output                      = 77-ckpt.out
 error                       = 77-ckpt.err
@@ -78,16 +71,18 @@
 
 {subsection: How Frequently to Checkpoint}
 
-Ballpark, you should aim to checkpoint once an hour.  However, this depends on a number of factors:
+Obviously, the longer the code spends writing checkpoints, and the longer your job spends transferring them, the longer it will take for you to get the job's results.  Conversely, the more frequently the job transfers new checkpoints, the less time the job loses if it's interrupted.  For most users and for most jobs, taking a checkpoint about once an hour works well, and it's not a bad place to start experimenting.  A number of factors will skew this interval up or down:
 
-*:  How frequently your executable "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  If your job writes checkpoints very frequently, you may only want to exit every tenth (or whatever) checkpoint.
-*:  How frequently your job is interrupted.  For instance, if the max job lifetime on your pool is six hours, taking a checkpoint every three hours could easily result in losing almost (but not quite actually) three hours of work.
-*:  How long it takes to make and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Generally, when checkpoints take longer to make and/or transfer, you want to checkpoint less frequently, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.
-*:  Your appetite for deadline risk versus your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if the job is interrupted, you'll wait longer because the job will have to recompute more.
+*:  If your job(s) usually run on resources with strict time limits, you may want to adjust how often your job checkpoints to minimize wasted time.  For instance, if your job writes a checkpoint after each hour, and each checkpoint takes five minutes to write out and then transfer, your fifth checkpoint will finish twenty-five minutes into the fifth hour, and you won't gain any benefit from the next thirty-five minutes of computation.  If you instead write a checkpoint every eighty-four minutes, your job will only waste four minutes.
+*:  If a particular code writes larger checkpoints, or writes smaller checkpoints unusually slowly, you may want to take a checkpoint less frequently than you would for other jobs of a similar length, to keep the total overhead (delay) the same.  The opposite is also true: if the job can take checkpoints particularly quickly, or the checkpoints are particularly small, the job could checkpoint more often for the same amount of overhead.
+*:  Some code naturally checkpoints at longer or shorter intervals.  If a code writes a checkpoint every five minutes, it may make sense for the =executable= to wait for the code to write ten or more checkpoints before exiting (which asks HTCondor to transfer the checkpoint file(s)).  If a job is a sequence of steps, the natural (or only) checkpoint interval may be between steps.
+*:  How long it takes to restart from a checkpoint.  It should never take longer to restart from a checkpoint than to recompute from the beginning, but the restart process is part of the overhead of taking a checkpoint.  The longer a code takes to restart, the less often the =executable= should exit.
+
+Measuring how long it takes to make checkpoints is left as an exercise for the reader.  Since version 8.9.1, however, HTCondor will report in the job's log (if a log is enabled for that job) how long file transfers, including checkpoint transfers, took.
 
 {subsection: Debugging Self-Checkpointing Jobs}
 
-Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred.  To do so, use =condor_vacate_job= to evict the job.  When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it).  Finally, to obtain the checkpoint file(s) themselves, use =condor_config_val= to ask where the SPOOL is, and then look in appropriate subdirectory.
+Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred.  To do so, use =condor_vacate_job= to evict the job.  When that's done (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it).  Finally, to obtain the checkpoint file(s) themselves, use =condor_config_val= to ask where =SPOOL= is, and then look in appropriate subdirectory (named after the job ID).
 
 For example, if your job is ID 635.0, and is logging to the file 'job.log', you can copy the files in the checkpoint to a subdirectory of the current as follows:
 {code}
@@ -110,60 +105,54 @@
 
 {subsection: Working Around the Assumptions}
 
-1:  If your executable can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script may be able to inspect the sandbox after the executable exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the executable can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your executable can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken, kill the executable, and then exit with a unique code.
-2:  If your executable requires different arguments to start from a checkpoint, you can wrap your executable in a script which checks for the presence of a checkpoint and starts the executable with the corresponding arguments.
-3:  If it takes too long to start your executable up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the amount of progress you lose when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
-4:  If your executable does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoint file(s) instead.  More specifically, if your executable writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 77 when it's finished doing so, your wrapper script can check for exit code 77 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  For example, you could write the following wrapper for '77-ckpt.py':
-{file: 77-atomic.sh}
-#!/usr/bin/bash
-# Remove the non-atomic checkpoint so it won't be used if we didn't create
-# an atomic checkpoint.
-rm -f 77.ckpt
-# If we have an atomic checkpoint, rename it to the checkpoint file that
-# 77-atomic.py is expecting.
-if [ -f 77.ckpt.atomic ]; then
-    mv 77.ckpt.atomic 77.ckpt
-fi
-# Run 77-atomic and record its exit code.
-./77-atomic.py
-RV=$?
-# If we took a checkpoint, atomatically rename it.
-if [ ${RV} -eq 77]; then
-    mv 77.ckpt 77.ckpt.atomic
-fi
-exit ${RV}
-{endfile}
-and change =transfer_output_files= to =77.ckpt.atomic=.
-
-Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax the requirement for atomic checkpoint udpates; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file successfully transfers back, or none of them do.)
+The basic technique here is to write a wrapper script (or modify an existing one), so that the =executable= has the necessary behavior, even if the code does not.
 
-Future versions of HTCondor may provide for explicit coordination between the executable and HTCondor.  Explicitly coordinate with HTCondor would substantially alter these requirements.
+1: _Your executable exits after taking a checkpoint with an exit code it does not otherwise use._
+*:: If your code exits when it takes a checkpoint, but not with a unique code, your wrapper script will have to determine, when the executable exits, if it did so because it took a checkpoint.  If so, the wrapper script will have to exit with a unique code.  If the code could usefully exit with any code, and the wrapper script therefore can not exit with a unique code, you can instead instruct HTCondor to consider being kill by a particular signal as a sign of successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  (If you do not set =CheckpointExitCode=, you must set =+WantFTOnCheckpoint=.)
+*:: If your code does not exit when it takes a checkpoint, the wrapper script will have to determine when a checkpoint has been made, kill the program, and then exit with a unique code.
+2: _When restarted, your executable determines on its own if a checkpoint is available, and if so, uses it._
+*:: If your code requires different arguments to start from a checkpoint, the wrapper script must check for the presence of a checkpoint and start the executable with correspondingly modified arguments.
+3:  _Starting your executable up from a checkpoint is relatively quick._
+*:: The longer the start-up delay, the slower the job's overall progress.  If your job's progress is too slow as a result of start-up delay, and your code can take checkpoints without exiting, read the 'Delayed Transfers' and 'Manual Transfers' sections below.
 
 {subsection: Other Options}
 
 The preceding sections of this HOWTO explain how a job meeting the requirements can take checkpoints at arbitrary intervals and transfer them back to the submit node.  Although this is the method of operation most likely to result in an interrupted job continuing from a valid checkpoint, other, less reliable options exist.
 
-{subsubsection: Delayed and Manual Transfers}
+{subsubsection: Delayed Transfers}
 
-If your executable takes checkpoints but can not exit with a unique code when it does, you have two options.  The first is much simpler, but only preserves progress when your job is evicted (e.g., not when the machine shuts down or the network fails).  To ensure that the checkpoint(s) a job has already taken are transferred when evicted, set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=, and include the checkpoint file(s) in =transfer_output_files=.  All the other assumptions still apply, except that quick restarts may be less important if eviction is infrequent in your pool.
+This method is risky, because it does not allow your job to recover from any failure mode other than an eviction (and sometimes not even then).  It may also require changes to your =executable=.  The advantage of this method is that it doesn't require your code to restart, or even a recent version of HTCondor.
 
-If your job can determine when it has successfully taken a checkpoint, but it can not stop when it does, or doing so is too expensive, it could instead transfer its checkpoints without HTCondor's help.  In most cases, this will involve using =condor_chirp= (by setting =+WantIOProxy= to =TRUE= and calling the =condor_chirp= command-line tool).  Your job would be responsible for fetching its own checkpoint file(s) on start-up.  (You could also create an empty checkpoint file and list it as part of =transfer_input_files=.)
+The basic idea is to take checkpoints as the job runs, but not transfer them back to the submit node until the job is evicted.  This implies that your =executable= doesn't exit until the job is complete (which is the normal case).  If your code has long start-up delays, you'll naturally not want it to exit after it writes a checkpoint; otherwise, the wrapper script could restart the code as necessary.
 
-{subsubsection: Signals}
+To use this method, set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= instead of setting =CheckpointExitCode=.  This will cause HTCondor to transfer your checkpoint file(s) (which you listed in =transfer_output_files=, as noted above) when the job is evicted.  Of course, since this is the only time your checkpoint file(s) will be transferred, if the transfer fails, your job has to start over from the beginning.  One reason file transfer on eviction fails is if it takes too long, so this method may not work if your =transfer_output_files= contain too much data.
 
-If you're not already familiar with the programmer's use of the word 'signals', skip this section and the next.
+Furthermore, eviction can happen at any time, including while the code is updating its checkpoint file(s).  If the code does not update its checkpoint file(s) atomically, HTCondor will transfer the partially-updated checkpoint file(s), potentially overwriting the previous, complete one(s); this will probably prevent the code from picking up where it left off.
 
-If your executable can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual for =+WantFTOnCheckpoint=.
+In some cases, you can work around this problem by using a wrapper script.  The idea is that renaming a file is an atomic operation, so if your code writes checkpoints to one file, call it =checkpoint=, your wrapper script -- when it detects that the checkpoint is complete -- would rename that file =checkpoint.atomic=.  That way, =checkpoint.atomic= always has a complete checkpoint in it.  With a such a script, instead of putting =checkpoint= in =transfer_output_files=, you would put =checkpoint.atomic=, and HTCondor would never see a partially-complete checkpoint file.  (The script would also, of course, have to copy =checkpoint.atomic= to =checkpoint= before running the code.)
 
-{subsubsection: Reactive Checkpoints}
+{subsubsection: Manual Transfers}
 
-If you're not already familiar with the programmer's use of the 'signals',
-skip this section.
+If you're comfortable with programming, instead of running a job with =CheckpointExitCode=, you could use =condor_chirp=, or other tools, to manage your checkpoint file(s).  Your =executable= would be responsible for downloading the checkpoint file(s) on start-up, and periodically uploading the checkpoint file(s) during execution.  We don't recommend you do this for the same reasons we recommend against managing your own input and output file transfers.
 
-Instead of taking a checkpoint at some interval, it is possible, for some types of interruption, to instead take a checkpoint when interrupted.  Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your executable may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is evicted.  This works like the previous section, except that you set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal that causes your job to checkpoint, and the signal is only sent when your job is preempted.  The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its =KillSig=; a job may request a longer delay than the machine's default by setting =JobMaxVacateTime= (but this will be capped by the administrator's setting).
+{subsubsection: Early Checkpoint Exits}
 
-You should probably only use this method of operation if your job runs on an HTCondor pool too old to support =+WantFTOnCheckpoint=, or the pool administrator has disallowed use of the feature (because it can be resource-intensive).
 
-{subsubsection: Early Checkpoint Exits}
 
 If your executable's natural checkpoint interval is half or more of your pool's max job runtime, it may make sense to checkpoint and then immediately ask to be rescheduled, rather than lower your user priority doing work you know will be thrown away.  In this case, you can use the =OnExitRemove= job attribute to determine if your job should be rescheduled after exiting.  Don't set =ON_EXIT_OR_EVICT=, and don't set =+WantFTOnCheckpoint=; just have the job exit with a unique code after its checkpoint.
+
+
+
+{subsection: Signals}
+
+Signals offer additional options for running self-checkpointing jobs.  If you're not familiar with signals, this section may not make sense to you.
+
+{subsubsection: Periodic Signals}
+
+HTCondor supports transferring checkpoint file(s) for =executable=s which take a checkpoint when sent a particular signal, if the =executable= then exits in a unique way.  Set =+WantCheckpointSignal= to =TRUE= to periodically receive checkpoint signals, and =+CheckpointSig= to specify which one.  (The interval is specified by the administrator of the execute machine.)  The unique way be a specific exit code, for which you would set =CheckpointExitCode=, or a signal, for which you would set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  (If you do not set =CheckpointExitCode=, you must set =+WantFTOnCheckpoint=.)
+
+{subsubsection: Delayed Transfer with Signals}
+
+This method is very similar to but riskier than delayed transfers, because in addition to delaying the transfer of the checkpoint files(s), it also delays their creation.  Thus, this option should almost never be used; if taking and transferring your checkpoint file(s) is fast enough to reliably complete during an eviction, you're not losing much by doing so periodically, and it's unlikely that a code which takes small checkpoints quickly takes a long time to start up.  However, this method will work even with very old version of HTCondor.
+
+To use this method, set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal that causes your job to checkpoint.