HTCondorWiki: How To Run Self Checkpointing Jobs

 
 This HOWTO describes how to use the features of HTCondor to run self-checkpointing jobs.  This HOWTO is not about to checkpoint a given job; that's up to you or your software provider.
 
-This HOWTO focuses on using =+WantFTOnCheckpoint=, a submit command which causes HTCondor to do file transfer when the job checkpoints.  This feature makes a number of assumptions about the job, detailed in the 'Assumptions' section.  These assumptions may not be true for your job, but in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to make them true.  If you can not, consult the _Working Around the Assumptions_ and/or _Other Options_ sections.
+This HOWTO focuses on using =+WantFTOnCheckpoint=, a submit command which causes HTCondor to do file transfer when a job indicates that it has taken a checkpoint by exiting in a specific way.  While there are other approaches, we recommend this one because it allows jobs to checkpoint and resume across as many different types of interruptions as possible, and because among the techniques that make the same claim, it's the simplest.
+
+The =+WantFTOnCheckpoint= submit command makes a number of assumptions about the job, detailed in the 'Assumptions' section.  These assumptions may not be true for your job, but in many cases, you will be able to modify your job (by adding a wrapper script or altering an existing one) to make them true.  If you can not, consult the _Working Around the Assumptions_ section and/or the _Other Options_ section.
 
 {subsection: General Idea}
 
@@ -10,8 +12,13 @@
 
 File transfer on checkpoint works in the following way:
 
-1:  The job exits after taking a checkpoint with a unique exit code.
-2:  HTCondor recognizes the unique exit code and does file transfer; this file transfer presently behaves as it would if the job were being evicted.  This implies that =when_to_transfer_output= is set to =ON_EXIT_OR_EVICT=.
+1:  The job exits after taking a checkpoint with a unique exit code.  For example, the following Python script exits with code 77.
+{file: 77.py}
+import sys
+print("Now exiting with code 77.")
+sys.exit(77)
+{endfile}
+2:  HTCondor recognizes the unique exit code and does file transfer; this file transfer presently behaves as it would if the job were being evicted.  Thus, your job must set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.
 3:  After the job's =transfer_output_files= are successfully sent to the submit node (and are stored in the schedd's spool, as normal for file transfer on eviction), HTCondor restarts the job exactly as it started it the first time.
 4:  If something interrupts the job, HTCondor reschedules it as normal.  As for any job with =ON_EXIT_OR_EVICT= set, HTCondor will restart the job with the files stored in the schedd's spool.
 
@@ -20,7 +27,16 @@
 {subsection: Assumptions}
 
 1:  Your job exits after taking a checkpoint with an exit code it does not otherwise use.
-*::  If your job does not exit when it takes a checkpoint, HTCondor can not (currently) transfer its checkpoint.  If your job does not exit with a unique code when it takes a checkpoint, HTCondor will transfer files and restart the job whenever the job exits with that code; if the checkpoint code and the terminal exit code are the same, your job will never finish.
+*::  If your job does not exit when it takes a checkpoint, HTCondor can not (currently) transfer its checkpoint.  If your job exits normally when it takes a checkpoint, HTCondor will not be able to tell the difference between taking a checkpoint and actually finishing; if the checkpoint code and the terminal exit code are the same, your job will never finish.  For example, the following Python script exits with code 77 half of the time, and code 0 half of the time.
+{file: 77-time.py}
+import sys
+import time
+
+if time.time() % 2 > 1:
+    sys.exit(77)
+else:
+    sys.exit(0)
+{endfile}
 2:  When restarted, your job determines on its own if a checkpoint is available, and if so, uses it.
 *::  If your job does not look for a checkpoint each time it starts up, it will start from scratch each time; HTCondor does not run a different command line when restarting a job which has taken a checkpoint.
 3:  Starting your job up from a checkpoint is relatively quick.
@@ -30,42 +46,47 @@
 
 {subsection: Using +WantFTOnCheckpoint}
 
-Given the following toy self-checkpointing script,
+Given the following toy Python script:
 
-{file: vuc.sh}
-#!/bin/bash
+{file: vuc.py}
+#!/usr/bin/env python
 
-VALUE=0
-if [ -e vuc.ckpt ]; then
-    for value in `cat vuc.ckpt`; do
-        VALUE=$value
-        break
-    done
-fi
-
-while [ ${VALUE} -lt 10 ]; do
-    echo "Computing timestep ${VALUE}..."
-    sleep 5
-    VALUE=$((VALUE+1))
-    echo ${VALUE} > vuc.ckpt
-    if [ $((VALUE%3)) -eq 0 ]; then
-        exit 17
-    fi
-done
+import sys
+import time
 
-exit 0
+value = 0
+try:
+    with open('vuc.ckpt', 'r') as f:
+        value = int(f.read())
+except IOError:
+    pass
+
+print("Starting from {0}".format(value))
+for i in range(value,10):
+    print("Computing timestamp {0}".format(value))
+    time.sleep(1)
+    value += 1
+    with open('vuc.ckpt', 'w') as f:
+        f.write("{0}".format(value))
+    if value%3 == 0:
+        sys.exit(77)
+
+print("Computation complete")
+sys.exit(0)
 {endfile}
 
 the following submit file will transfer the file 'vuc.ckpt' back to the submit at timesteps 3, 6, and 9.  If interrupted, it will resume from the most recent of those checkpoints.
 
 {file: vuc.submit}
-executable                  = vuc.sh
+# These first three lines are the magic ones.
+when_to_transfer_output     = ON_EXIT_OR_EVICT
++WantFTOnCheckpoint         = TRUE
++SuccessCheckpointExitCode  = 77
+
+executable                  = vuc.py
 arguments                   =
 
 should_transfer_files       = yes
-when_to_transfer_output     = ON_EXIT_OR_EVICT
-+WantFTOnCheckpoint         = TRUE
-+SuccessCheckpointExitCode  = 17
 transfer_output_files       = vuc.ckpt
 
 output                      = vuc.out
@@ -75,7 +96,7 @@
 queue 1
 {endfile}
 
-This script/submit file combination does not remove the "checkpoint file" generated for timestep 9 when the job completes.  This could be done in 'vuc.sh' immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
+This script/submit file combination does not remove the "checkpoint file" generated for timestep 9 when the job completes.  This could be done in 'vuc.py' immediately before it exits, but that would cause the final file transfer to fail.  The script could instead remove the file and then re-create it empty, it desired.
 
 {subsection: How Frequently to Checkpoint}
 
@@ -114,7 +135,10 @@
 1:  If your job can be made to exit after taking a checkpoint, but does not return a unique exit code when doing so, a wrapper script for the job may be able to inspect the sandbox after the job exits and determine if a checkpoint was successfully created.  If so, this wrapper script could then return a unique value.  If the job can return literally any value at all, HTCondor can also regard being killed by a particular (unique) signal as a sign of a successful checkpoint; set =+SuccessCheckpointExitBySignal= to =TRUE= and =+SuccessCheckpointExitSignal= to the particular signal.  If your job can not be made to exit after taking a checkpoint, a wrapper script may be able to determine when a successful checkpoint has been taken and kill the job.
 2:  If your job requires different arguments to start from a checkpoint, you can wrap your job in a script which checks for the presence of a checkpoint and runs the jobs with the corresponding arguments.
 3:  If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
-4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.
+4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  The following Python script implements the logic above:
+{file: 77-atomic.py}
+FIXME
+{endfile}
 
 Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)