*: Releases are now in /p/condor/public/binaries/contrib, named dmtcp_condor_integration-*-Any-Any.tar.gz where * is the version number. *: The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with {code}git clone /p/condor/repository/dmtcp_condor.git/{endcode} +*: When making a new release, you need to change the version number inside of shim_dmtcp. +*: DMTCP has a build option for Condor. The normal behavior is to try and checkpoint sockets. When built for Condor, DMTCP behaves like Condor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed. + +General system: +*: The user submits a job. Notable changes to their submit file are: +{code} +# Submit the ship_dmtcp instead of their normal job. +executable=shim_dmtcp + +# Those are IN ADDITION to the users "real" binary and input files! +transfer_input_files = dmtcp_checkpoint,dmtcp_coordinator,\ + dmtcp_command,dmtcp_restart,dmtcphijack.so,libmtcp.so,\ + mtcp_restart + +# Argument Meaning +# --log log file name for actions in shim_dmtcp script, +# if n/a use /dev/null +# --stdin stdin file, if n/a use /dev/null +# --stdout stdout file, if n/a use /dev/null +# --stderr stderr file, if n/a use /dev/null +# --ckptint checkpointing interval in seconds +# 1 the executable name you should have transferred in +# 2+ arguments to the executable +# +# Note that stdout/stderr files are for output from the "real" +# binaries. The normal output/error will execlusively be +# messages from shim_dmtcp and the DMTCP tools. +arguments = --log shim_dmtcp.$(CLUSTER).$(PROCESS).log \ + --stdin foo.py \ + --stdout job.$(CLUSTER).$(PROCESS).out \ + --stderr job.$(CLUSTER).$(PROCESS).err \ + --ckptint 1800 \ + ./REAL_BINARY example-argument-one example-argument-two + + +# These are all required by DMTCP. JALIB is an internal DMTCP +# library ("Jason's library"). If your jobs needs more +# environment options set, just append them. +environment=DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;\ + DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS) + +# On kill, tell our shim to checkpoint. You can change this, but will +# need to change shim_dmtcp as well +kill_sig = 2 + +# If your pool isn't homogenous (nearly identical distributions +# and updates), your checkpoints may not be portable. The exact +# options needed aren't yet knom, but these may work. Note that +# you'll need to identify the exact values yourself; these won't +# work for you! +Requirements = \ + (CheckpointPlatform == "LINUX INTEL 2.6.x normal 0x40000000"\ + && OSKernelRelease == "2.6.18-128.el5") +{endcode} + +*: The shim starts up the dmtcp_coordinator. It is given instructions to checkpoint at regular intervals. +*: The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?) +*: Whe + +Contents of the DMTCP Condor integration repository (and tarballs at the moment): +*: =Makefile= - Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions. +*: testing and development tools +*:: =foo.py= - Python script used for testing Python under DMTCP +*:: =foo.c=- C program used for testing Python under DMTCP