{subsection: Debugging Self-Checkpointing Jobs} -Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred. To do so, use =condor_vacate_job= to evict the job. When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it). Finally, to obtain the checkpoint file(s) themselves, use =condor_transfer_data=. +Because a job may be interrupted at any time, it's valid to interrupt the job at any time and see if a valid checkpoint is transferred. To do so, use =condor_vacate_job= to evict the job. When it's done transferring (watch the user log), use =condor_hold= to put it on hold, so that it can't restart while you're looking at the checkpoint (and potentially, overwrite it). Finally, to obtain the checkpoint file(s) themselves, use =condor_config_val= to ask where the SPOOL is, and then look in appropriate subdirectory. -[example] +For example, if your job is ID 635.0, and is logging to the file 'job.log', you can copy the files in the checkpoint to a subdirectory of the current as follows: +{code} +condor_vacate_job 635.0 +# Wait for the job to finish being evicted; hit CTRL-C when you see 'Job was evicted.' +tail --follow job.log +condor_hold 635.0 +# Copy the checkpoint files from the spool. +# Note that _condor_stderr and _condor_stdout are the files corresponding to the job's +# output and error submit commands; they aren't named correctly until the the job finishes. +cp -a `condor_config_val SPOOL`/635/0/cluster635.proc0.subproc0 . +# Then examine the checkpoint files to see if they look right. +# ... +# When you're done, release the job to see if it actually works right. +condor_release 635.0 +condor_ssh_to_job 635.0 +{endcode} {subsection: Working Around the Assumptions}