HTCondorWiki: How To Run Self Checkpointing Jobs

 
 The following Python script is a toy example of checkpointing.  It counts from 0 to 10 (exclusive), sleeping for 10 seconds at each step.  It writes a checkpoint file consisting only of the next number after each nap, and exits with code 77 at count 3, 6, and 9.  It exits with code 0 when complete.
 
-{file: vuc.py}
+{file: 77-ckpt.py}
 #!/usr/bin/env python
 
 import sys
@@ -56,7 +56,7 @@
 
 value = 0
 try:
-    with open('vuc.ckpt', 'r') as f:
+    with open('77.ckpt', 'r') as f:
         value = int(f.read())
 except IOError:
     pass
@@ -66,7 +66,7 @@
     print("Computing timestamp {0}".format(value))
     time.sleep(10)
     value += 1
-    with open('vuc.ckpt', 'w') as f:
+    with open('77.ckpt', 'w') as f:
         f.write("{0}".format(value))
     if value%3 == 0:
         sys.exit(77)
@@ -75,29 +75,29 @@
 sys.exit(0)
 {endfile}
 
-The following submit file commands HTCondor to transfer the file 'vuc.ckpt' to the submit node whenever the script exits with code 77.  If interrupted, the job will resume from the most recent of those checkpoints.
+The following submit file commands HTCondor to transfer the file '77.ckpt' to the submit node whenever the script exits with code 77.  If interrupted, the job will resume from the most recent of those checkpoints.
 
-{file: vuc.submit}
+{file: 77.submit}
 # You must set both of these things to do file transfer when you exit...
 +WantFTOnCheckpoint         = TRUE
 when_to_transfer_output     = ON_EXIT_OR_EVICT
 # ... with code 77.
 +SuccessCheckpointExitCode  = 77
 # You must include your checkpoint file in transfer_output_files.
-transfer_output_files       = vuc.ckpt
+transfer_output_files       = 77.ckpt
 
-executable                  = vuc.py
+executable                  = 77-ckpt.py
 arguments                   =
 should_transfer_files       = yes
 
-output                      = vuc.out
-error                       = vuc.err
-log                         = vuc.log
+output                      = 77-ckpt.out
+error                       = 77-ckpt.err
+log                         = 77-ckpt.log
 
 queue 1
 {endfile}
 
-This script/submit file combination does not remove the "checkpoint file" generated for timestep 9 when the job completes.  This could be done in 'vuc.py' immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
+This script/submit file combination does not remove the "checkpoint file" generated for timestep 9 when the job completes.  This could be done in '77-ckpt.py' immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
 
 {subsection: How Frequently to Checkpoint}
 
@@ -136,10 +136,27 @@
 1:  If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job.
 2:  If your job requires different arguments to start from a checkpoint, you can wrap your job in a script which checks for the presence of a checkpoint and runs the jobs with the corresponding arguments.
 3:  If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
-4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  The following Python script implements the logic above:
-{file: 77-atomic.py}
-
+4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 77 when it's finished doing so, your wrapper script can check for exit code 77 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  For example, you could write the following wrapper for '77-ckpt.py':
+{file: 77-atomic.sh}
+#!/usr/bin/bash
+# Remove the non-atomic checkpoint so it won't be used if we didn't create
+# an atomic checkpoint.
+rm -f 77.ckpt
+# If we have an atomic checkpoint, rename it to the checkpoint file that
+# 77-atomic.py is expecting.
+if [ -f 77.ckpt.atomic ]; then
+    mv 77.ckpt.atomic 77.ckpt
+fi
+# Run 77-atomic and record its exit code.
+./77-atomic.py
+RV=$?
+# If we took a checkpoint, atomatically rename it.
+if [ ${RV} -eq 77]; then
+    mv 77.ckpt 77.ckpt.atomic
+fi
+exit ${RV}
 {endfile}
+and change =transfer_output_files= to =77.ckpt.atomic=.
 
 Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)