HTCondorWiki: How To Run Self Checkpointing Jobs

 
 {subsection: How Frequently to Checkpoint}
 
-FIXME
+Ballpark, your job should aim to checkpoint once an hour.  However, this depends on a number of factors:
+
+*:  How frequently your job "naturally" checkpoints.  For instance, a particular simulation may compute a time step roughly every five minutes on an average machine.  If the simulator can be configured to write a checkpoint every ten time-steps, that gives the simulator and HTCondor ten minutes to write and then transfer the checkpoint.  (Of course, if it actually takes that long, your job will take a sixth longer, which may be too slow; see below.)  If instead each job is a batch of smaller tasks (e.g., run six different computations on the same input data), it may be most convenient to checkpoint after each computation completes.  If that takes about forty-five minutes for some and ninety for others, that's fine.  (Just because a job which writes checkpoints frequently doesn't mean it has to exit that often, but the details of writing or altering a script to do that are beyond the scope of this HOWTO.)
+*:  How frequently your job is interrupted.  For instance, if the max job lifetime on your pool is six hours, taking a checkpoint every three hours could easily result in losing almost (but not quite actually) three hours of work.
 
 {subsection: Debugging Checkpoints}
 
@@ -66,3 +69,7 @@
 Instead of taking a checkpoint at some interval, it is possible, for some specific interruptions, to instead take a checkpoint when interrupted.  Specifically, if your execution resources are generally reliable, and your job's checkpoints both quick to take and small, your job may be able to generate a checkpoint, and transfer it back to the submit node, at the time your job is preempted.  This works like the previous section, except that you set =when_to_transfer_output= to =ON_EXIT_OR_EVICT= and =KillSig= to the particular signal, and the signal is only sent when your job is preempted.  The administrator of the execute machine determines the maximum amount of time is allowed to run after receiving its =KillSig=; a job may request a longer delay than the machine's default by setting =JobMaxVacateTime= (but this will be capped by the administrator's setting).
 
 You should probably only use this method of operation if your job runs on an HTCondor pool too old to support =+WantFTOnCheckpoint=, or the pool administrator has disallowed use of the feature (because it can be resource-intensive).
+
+{subsubsection: Early Checkpoint Exits}
+
+If your job's natural checkpoint interval is half or more of your pool's max job runtime, it may make sense to checkpoint and then immediately ask to be rescheduled, rather than lower your user priority doing work you know will be thrown away.  In this case, you can use the =OnExitRemove= job attribute to determine if your job should be rescheduled after exiting.  Don't set =ON_EXIT_OR_EVICT=, and don't set =+WantFTOnCheckpoint=; just have the job exit with a unique code after its checkpoint.