HTCondorWiki: How To Run Self Checkpointing Jobs

 *:  How long it takes to take and transfer checkpoints.  Measuring the first may be difficult (how do you know when the job started taking a checkpoint?), but the second, in practice, can only be done experimentally.  (HTCondor versions 8.9.1 and later record file transfer events in the user job log (the =log= submit command), including for file transfers on checkpoint, so this duration should be easy to determine.)  Unfortunately, you generally want to checkpoint less frequently (if possible) when checkpoints take longer, so as to maintain efficient job progress.  However, the longer between intervals, the more progress the job will lose when interrupted.  The timing of interruptions isn't (in general) predictable, and (in practice) varies from pool to pool, and can also only be predicated assuming its similarity to past experiments.  The big exception to this is above: max job runtimes.
 *:  Your appetite for deadline risk vs your desire for fast turn-arounds.  Generally, the longer you go between checkpoints, the sooner the job will complete (because taking checkpoints and transferring them takes time).  On the other hand, if you are interrupted, you'll lose more progress.
 
-{subsection: Debugging Checkpoints}
+{subsection: Debugging Self-Checkpointing Jobs}
 
-FIXME  (Use =condor_vacate_job=, =condor_hold=, and =condor_transfer_data=.)
+Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred.  To do so, use =condor_vacate_job= to evict the job.  When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it).  Finally, to obtain the checkpoint file(s) themselves, use =condor_transfer_data=.
+
+[example]
 
 {subsection: Working Around the Assumptions}