HTCondorWiki: How To Run Self Checkpointing Jobs

 
 {subsection: Debugging Self-Checkpointing Jobs}
 
-Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred.  To do so, use =condor_vacate_job= to evict the job.  When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it).  Finally, to obtain the checkpoint file(s) themselves, use =condor_transfer_data=.
+Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred.  To do so, use =condor_vacate_job= to evict the job.  When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it).  Finally, to obtain the checkpoint file(s) themselves, use =condor_config_val= to ask where the SPOOL is, and then look in appropriate subdirectory.
 
-[example]
+For example, if your job is ID 635.0, and is logging to the file 'job.log', you can copy the files in the checkpoint to a subdirectory of the current as follows:
+{code}
+condor_vacate_job 635.0
+# Wait for the job to finish being evicted; hit CTRL-C when you see 'Job was evicted.'
+tail --follow job.log
+condor_hold 635.0
+# Copy the checkpoint files from the spool.
+# Note that _condor_stderr and _condor_stdout are the files corresponding to the job's
+# output and error submit commands; they aren't named correctly until the the job finishes.
+cp -a `condor_config_val SPOOL`/635/0/cluster635.proc0.subproc0 .
+# Then examine the checkpoint files to see if they look right.
+# ...
+# When you're done, release the job to see if it actually works right.
+condor_release 635.0
+condor_ssh_to_job 635.0
+{endcode}
 
 {subsection: Working Around the Assumptions}