See also StandardUniverse

Usage

Starting up

Start your program like so:

setarch x86_64 -L -B -R ./a.out -_condor_ckpt ckpt

Replace a.out with your program name.

Replace "x86_64" with "i386" if you're on a 32-bit system. You may be able to omit the -B and associated 32-bit memory limit under unknown circumstances. You can add arguments to the program itself before or after the arguments above.

Replace ckpt with the file name you want your checkpoint written into. You can omit all of the arguments, in which case the checkpoint file is your program name with .ckpt appended.

Checkpointing

To checkpoint and exit, send SIGTSTP. To checkpoint and continue, send SIGUSR2.

Checkpoints should be atomic; they are written to ckpt.tmp and moved to ckpt upon completion.

When checkpointing and exiting, the process will exit with the signal SIGUSR2 (Not SIGTSTP, which is what you sent it!)

Resuming a checkpoint

To resume from a checkpoint, do this:

setarch x86_64 -L -B -R ./a.out -_condor_restart ckpt -_condor_relocatable

This is the same command line as before, but -_condor_ckpt is now -_condor_restart and -_condor_relocatable has been added.

Checkpointing in HTCondor's vanilla universe

  1. Register a signal handler for SIGTERM; it should send SIGTSTP to the running "real" job.
  2. Register a signal handler for SIGTSTP; it should send SIGTSTP to the running "real" job. (This will probably be unused, but is included to offer an option that is more similar to the standard universe.)
  3. Register a signal handler for SIGUSR2; it should send SIGUSR2 to the running "real" job. (As of July, 2012, this will never be used, but we anticipate adding support for periodic checkpoints in the vanilla universe.)
  4. Is the checkpoint file ("my-real-job.ckpt") present? Then the arguments are "-_condor_restart my-real-job.ckpt -_condor_relocatable". Otherwise the arguments are "-_condor_ckpt my-real-job.ckpt".
  5. Start your "real" job
    • Start under setarch to disable address randomization and similar that HTCondor checkpointing cannot cope with. The "-B" (limiting memory to 32 bit addresses) may not be necessary in all cases, but the specific circumstances are not known. The <arguments> are as determined above.
      • 32-bit: setarch i386 -L -B -R ./my-real-job <arguments>
      • 64-bit: setarch x86_64 -L -B -R ./my-real-job <arguments>
    • Note the PID, for use in the signal handlers mentioned above.
  6. Wait for your "real" job to exit.
    • If the job exited with SIGTSTP, then this is a checkpoint.

In pseudocode:

# Terminal checkpoint:
register signal SIGTERM handler:
        kill -TSTP $PID

# Terminal checkpoint:
register signal SIGTSTP handler:
        kill -TSTP $PID


# Periodic checkpoint:
register signal SIGUSR2 handler:
        kill -USR2 $PID

# Modern address randomization and related cause us grief. Disable.
$ARCH = i386 (or x86_64)
$COMPAT = setarch $ARCH -L -B -R

if a CKPT_FILENAME is NOT present
    $COMPAT ./$PROGRAM -_condor_ckpt $CKPT_FILENAME
    note the $PID
else
    $COMPAT ./$PROGRAM -_condor_relocatable -_condor_restart $CKPT_FILENAME
    note the $PID

Working Perl code: See checkpoint-wrapper, an attachment to this page.

Command line

A program that has been condor_compiled takes additional command line options. Generally speaking end users shouldn't need to know this; HTCondor will invoke them automatically. But for testing or doing checkpointing without the standard universe, this might be useful. These are officially undocumented and may change without warning!

Command file

You can pass a command file in through a file descriptor (-_condor_cmd_fd) or a file (-_condor_cmd_file). The command file can set several options, and also enables remote syscalls simply by being specified.

Commands in the file are one per line. The recognized commands are

Attachments: