Page History
- Tickets about DMTCP
- Releases are now in /p/condor/public/binaries/contrib, named dmtcp_condor_integration-*-Any-Any.tar.gz where * is the version number.
- The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with
git clone /p/condor/repository/dmtcp_condor.git/
- When making a new release, you need to change the version number inside of shim_dmtcp.
- DMTCP has a build option for Condor. The normal behavior is to try and checkpoint sockets. When built for Condor, DMTCP behaves like Condor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed.
General system:
- The user submits a job. Notable changes to their submit file are:
# Submit the ship_dmtcp instead of their normal job. executable=shim_dmtcp # Those are IN ADDITION to the users "real" binary and input files! transfer_input_files = dmtcp_checkpoint,dmtcp_coordinator,\ dmtcp_command,dmtcp_restart,dmtcphijack.so,libmtcp.so,\ mtcp_restart # Argument Meaning # --log log file name for actions in shim_dmtcp script, # if n/a use /dev/null # --stdin stdin file, if n/a use /dev/null # --stdout stdout file, if n/a use /dev/null # --stderr stderr file, if n/a use /dev/null # --ckptint checkpointing interval in seconds # 1 the executable name you should have transferred in # 2+ arguments to the executable # # Note that stdout/stderr files are for output from the "real" # binaries. The normal output/error will execlusively be # messages from shim_dmtcp and the DMTCP tools. arguments = --log shim_dmtcp.$(CLUSTER).$(PROCESS).log \ --stdin foo.py \ --stdout job.$(CLUSTER).$(PROCESS).out \ --stderr job.$(CLUSTER).$(PROCESS).err \ --ckptint 1800 \ ./REAL_BINARY example-argument-one example-argument-two # These are all required by DMTCP. JALIB is an internal DMTCP # library ("Jason's library"). If your jobs needs more # environment options set, just append them. environment=DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;\ DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS) # On kill, tell our shim to checkpoint. You can change this, but will # need to change shim_dmtcp as well kill_sig = 2 # If your pool isn't homogenous (nearly identical distributions # and updates), your checkpoints may not be portable. The exact # options needed aren't yet knom, but these may work. Note that # you'll need to identify the exact values yourself; these won't # work for you! Requirements = \ (CheckpointPlatform == "LINUX INTEL 2.6.x normal 0x40000000"\ && OSKernelRelease == "2.6.18-128.el5")
- The shim starts up the dmtcp_coordinator. It is given instructions to checkpoint at regular intervals.
- The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?)
- Whe
Contents of the DMTCP Condor integration repository (and tarballs at the moment):
Makefile
- Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions.- testing and development tools
foo.py
- Python script used for testing Python under DMTCP- =foo.c=- C program used for testing Python under DMTCP