*: User documentation: DmtcpCondor
*: {link:https://condor-wiki.cs.wisc.edu/index.cgi/search?s=dmtcp&t=1 Tickets about DMTCP}
*: Releases are now in /p/condor/public/binaries/contrib, named dmtcp_condor_integration-*-Any-Any.tar.gz where * is the version number.
*: The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with
{code}git clone /p/condor/repository/dmtcp_condor.git/{endcode}
*: When making a new release, you need to change the version number inside of shim_dmtcp.
*: DMTCP has a build option for HTCondor.  The normal behavior is to try and checkpoint sockets.  When built for HTCondor, DMTCP behaves like HTCondor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed.
*: Some changes to shim_dmtcp come from Michael Hanke, the Debian maintainer for HTCondor (among other packages)

Open tasks:
*: How can we identify platforms that a checkpoint is compatible with?  Machine ClassAd attributes of possible interest include CheckpointPlatform and OSKernelRelease.
*: Should the signal to trap in shim_dmtcp should be overridable via a command line option?  Are there any circumstances in which we care?
*: The "normal" standard output and error should be from the real job, with outptu from shim_dmtcp and the related tools should go into a user-specified file.  At the moment the situation is backward.  Note that If shim_dmtcp utterly fails, it should report _something_ to the main stderr.
*: At the moment the script just waits 3 seconds for dmtcp_command to finish whatever it was doing, be it checkpointing or terminating the running processes.  (This is the checkpoint function in shim_dmtcp, look for calls to delay.)  dmtcp_command returns immediately, but the actual work may take longer.  psilord feels 3 seconds is extremely reliable, but it's not guaranteed.  Ideal solution: see if dmtcp_command has a --block-until-work-is-really-done option.  If not, ask for one upstream.
*: shim_dmtcp line 217 "ckptsig=`${CONDOR_PATH}/bin/condor_status -l $host 2>&1"..., don't bother with egreping, just use -format!
*: shim_dmtcp line 280 "/bin/sleep 2" change to "delay 2"
*: The dmtcp_restart_script.sh symlink renaming code around line 252 in shim_dmtcp exists to work around perceived limitations in HTCondor. dmtcp_restart_script.sh is a symlink, and psilord thinks HTCondor does something terrible, like returning it as a 0-length file.  (A copy would be fine!) Verify claim.  If true, report bug to HTCondor.  If not true and hasn't been true for "a while", remove the code.
*: shim_dmtcp line 328, the "Close the gcb" section, is probably moot in the new GCB free world and can be deleted.
*: shim_dmtcp line 355 (logit "could not start dmtcp_coordinator"): should it "exit 1"?  Should we continue? What will happen?
*: shim_dmtcp line 350 (port=`${dmtcp_coordinator} --port 0 --exit-on-last --interval ${CKPTINT} --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`) - We just threw away potentially useful output from the dmtcp_coordinator!  Write output and error to a temporary file, grep the port out of that, dump contents of temporary file to stdout so we have it for debugging.  Also: ask upstream to add a --port-file option that will write the port and nothing else to a file, to simplify extracting it.
*: shim_dmtcp 377 (${dmtcp_checkpoint} --port $port -j "$@" <$STDIN 1>$STDOUT 2>$STDERR &) - Why background this?  psilord believes that it is necessary for our signal trap to work. Verify.  If true, add a comment noting that.
*: shim_dmtcp 383 (if [ $ret -gt 128 ] ; then) the purpose of this code block is not clear.  Something about handling the user executable exiting with a signal.  Needs investigation.

*: Improve output messages in shim_dmtcp:
*:: line 109 "echo "Terminating..." >&2".  Should clarify that it's an internal error, and that getopt failed
*:: line 128 "*) echo "Internal error! ($1)"; exit 1;;" - Not an internal error; invalid usage/bad arguments.
*:: line 133 "printf "Need at least one argument.\n\n"" - append ", the program to run."
*: Comment typos
*:: shim_dmtcp line 195 "# Try to idenitify"..., spell identify.


General system:
*: The user submits a job.  Notable changes to their submit file are:
{code}
# Submit the ship_dmtcp instead of their normal job.
executable=shim_dmtcp

# Those are IN ADDITION to the users "real" binary and input files!
transfer_input_files = dmtcp_checkpoint,dmtcp_coordinator,\
    dmtcp_command,dmtcp_restart,dmtcphijack.so,libmtcp.so,\
    mtcp_restart

# Argument  Meaning
# --log     log file name for actions in shim_dmtcp script,
#           if n/a use /dev/null
# --stdin   stdin file, if n/a use /dev/null
# --stdout  stdout file, if n/a use /dev/null
# --stderr  stderr file, if n/a use /dev/null
# --ckptint checkpointing interval in seconds
# 1         the executable name you should have transferred in
# 2+        arguments to the executable
#
# Note that stdout/stderr files are for output from the "real"
# binaries.  The normal output/error will execlusively be
# messages from shim_dmtcp and the DMTCP tools.
arguments = --log shim_dmtcp.$(CLUSTER).$(PROCESS).log \
    --stdin foo.py \
    --stdout job.$(CLUSTER).$(PROCESS).out \
    --stderr job.$(CLUSTER).$(PROCESS).err \
    --ckptint 1800 \
    ./REAL_BINARY example-argument-one example-argument-two


# These are all required by DMTCP. JALIB is an internal DMTCP
# library ("Jason's library").  If your jobs needs more
# environment options set, just append them.
environment=DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;\
    DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS)

# On kill, tell our shim to checkpoint. You can change this, but will
# need to change shim_dmtcp as well
kill_sig = 2

# If your pool isn't homogenous (nearly identical distributions
# and updates), your checkpoints may not be portable.  The exact
# options needed aren't yet knom, but these may work.  Note that
# you'll need to identify the exact values yourself; these won't
# work for you!
Requirements = \
    (CheckpointPlatform == "LINUX INTEL 2.6.x normal 0x40000000"\
    && OSKernelRelease == "2.6.18-128.el5")
{endcode}

*: The shim starts up the dmtcp_coordinator.  It is given instructions to checkpoint at regular intervals.
*: The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?)
*: When restarting from a checkpoint, shim_dmtcp will notice the dmtcp_restart_script.sh created by DMTCP and will run that instead.

Contents of the DMTCP HTCondor integration repository (and tarballs at the moment):
*: =shim_dmtcp= - The heart of our system. Submit this instead of your job, passing suitable options, to get checkpointing behavior.
*: =job.sub= - Example submit file using shim_dmtcp; used as part of manual testing.
*: =Makefile= - Really just for development use. Collects DMTCP files and builds/submits a little test program.  Note that the pay to mtcp_restart varies in different distributions.
*: testing and development tools
*:: =foo.py= - Python script used for testing Python under DMTCP
*:: =foo.c= - C program used for testing Python under DMTCP