Presently, "success" is defined to be "exit with code 0"; this will change in a future revision. -[In that revision, "success" will be defined by three job ad attributes: CheckpointExitBySignal, CheckpointExitSignal, and CheckpointExitCode. If the first is true, then "success" occurs when the process exits on signal CheckpointExitSignal. Otherwise, "success" occurs when the process exits with code CheckpointExitCode.] +In 8.5.3, "success" will be defined by three job ad attributes: CheckpointExitBySignal, CheckpointExitSignal, and CheckpointExitCode. If the first is true, then "success" occurs when the process exits on signal CheckpointExitSignal. Otherwise, "success" occurs when the process exits with code CheckpointExitCode. (As this is an experimental feature, you'll need to prepend a '+' to each of the preceeding attributes when you use it in a submit file.) -[In another further revision, you may also (instead?) set '+WantFTOnCheckpoint = TRUE'. If during the course of normal execution the job exits with "success", as defined above, file transfer will occur as if the job is being evicted.] +In 8.5.3, you may also set '+WantFTOnCheckpoint = TRUE'. If during the course of normal execution the job exits with "success", as defined above, file transfer will occur as if the job is being evicted. Presently, intermediate file transfer is defined to be identical to file transfer as if on an eviction, which means that specifying a transfer_output_files list that doesn't include the checkpoint files will break things (the job will restart from scratch if rescheduled on another machine). This also implies that you'll only want to use this feature on a small scale for now -- the checkpoints (which may include the entire sandbox) are uploaded to the schedd's spool directory, which is probably space- and time- wise unwise. @@ -25,3 +25,5 @@ {section: Random Thoughts} The code currently doesn't tell the update the job ad on a successful checkpoint; it also does not correctly tell the shadow (and therefore the schedd or job log) if the job successfully checkpointed (it will always say it didn't). + +The remote usage accounting is probably entirely wrong.