Presently, "success" is defined to be "exit with code 0"; this will change in a future revision. +[In that revision, "success" will be defined by three job ad attributes: CheckpointExitBySignal, CheckpointExitSignal, and CheckpointExitCode. If the first is true, then "success" occurs when the process exits on signal CheckpointExitSignal. Otherwise, "success" occurs when the process exits with code CheckpointExitCode.] + +[In another further revision, you may also (instead?) set '+WantFTOnCheckpoint = TRUE'. If during the course of normal execution the job exits with "success", as defined above, file transfer will occur as if the job is being evicted.] + Presently, intermediate file transfer is defined to be identical to file transfer as if on an eviction, which means that specifying a transfer_output_files list that doesn't include the checkpoint files will break things (the job will restart from scratch if rescheduled on another machine). This also implies that you'll only want to use this feature on a small scale for now -- the checkpoints (which may include the entire sandbox) are uploaded to the schedd's spool directory, which is probably space- and time- wise unwise. {section: Configuration} @@ -16,6 +20,8 @@ In the job ad, set '+WantCheckpointSignal = TRUE'. You may also set '+CheckpointSig' to the signal you want sent. The signal will otherwise default to the soft kill signal (which defaults to SIGTERM). +If you also want to checkpoint on eviction, set 'when_to_transfer_output = ON_EXIT_OR_EVICT' in the job ad. You may also want to set 'kill_sig' to match '+CheckpointSig' (so that your job will receive its checkpoint signal, rather than its soft-kill signal (usually SIGTERM), when it's evicted). + {section: Random Thoughts} The code currently doesn't tell the update the job ad on a successful checkpoint; it also does not correctly tell the shadow (and therefore the schedd or job log) if the job successfully checkpointed (it will always say it didn't).