HTCondorWiki: How To Run Self Checkpointing Jobs

 3:  If it takes too long to start your job up from a checkpoint, you will need to take checkpoints less often to make quicker progress.  This, of course, increases the risk of losing a substantial amount of work when your job is interrupted.  See also the 'Delayed and Manual Transfers' section.
 4:  If a job does not update its checkpoints atomically, you can use a wrapper script to atomically update some other file(s) and treat those as the checkpoints.  More specifically, if your job writes a checkpoint incrementally to a file called 'checkpoint', but exits with code 17 when it's finished doing so, your wrapper script can check for exit code 17 and then rename 'checkpoint' to 'checkpoint.atomic'.  Because rename is an atomic operation, this will prevent HTCondor from transferring a partial checkpoint file, even if the job is interrupted in the middle of taking a checkpoint.  Your wrapper script will also have to copy 'checkpoint.atomic' to 'checkpoint' before starting the job, so that the job (a) uses the safe checkpoint file and (b) doesn't corrupt that checkpoint file if interrupted at a bad time.  The following Python script implements the logic above:
 {file: 77-atomic.py}
-FIXME
+
 {endfile}
 
 Future version of HTCondor may remove the requirement for job to set =when_to_transfer_output= to =ON_EXIT_OR_EVICT=.  Doing so would relax this requirement; the job would only have to ensure that its checkpoint was complete and consistent (if stored in multiple files) when it exited.  (HTCondor does not partially update the sandbox stored in spool: either every file succesfully transfers back, or none of them do.)
@@ -156,10 +156,15 @@
 
 {subsubsection: Signals}
 
+If you're not already familiar with the programmer's use of the word 'signals', skip this section and the next.
+
 If your job can not spontaneously exit with a unique exit code after taking a checkpoint, but can take a checkpoint when sent a particular signal and _then_ exit in a unique way, you may set =+WantCheckpointSignal= to =TRUE=, and =+CheckpointSig= to the particular signal.  HTCondor will send this signal to the job at interval set by the administrator of the execute machine; if the job exits as specified by the =SuccessCheckpoint= job attributes, its files will be transferred and the job restarted, as usual.
 
 {subsubsection: Reactive Checkpoints}
 
+If you're not already familiar with the programmer's use of the 'signals',
+skip this section.
+
 Instead of taking a checkpoint at some interval, it is possible, for some types of interruption, to instead take a checkpoint when interrupted.  Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your job may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is evicted.  This works like the previous section, except that you set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal (that causes your job to checkpoint), and the signal is only sent when your job is preempted.  The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its =KillSig=; a job may request a longer delay than the machine's default by setting =JobMaxVacateTime= (but this will be capped by the administrator's setting).
 
 You should probably only use this method of operation if your job runs on an HTCondor pool too old to support =+WantFTOnCheckpoint=, or the pool administrator has disallowed use of the feature (because it can be resource-intensive).