*: {link:https://condor-wiki.cs.wisc.edu/index.cgi/search?s=dmtcp&t=1 Tickets about DMTCP} *: Releases are now in /p/condor/public/binaries/contrib, named dmtcp_condor_integration-*-Any-Any.tar.gz where * is the version number. *: The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with {code}git clone /p/condor/repository/dmtcp_condor.git/{endcode} *: When making a new release, you need to change the version number inside of shim_dmtcp. *: DMTCP has a build option for Condor. The normal behavior is to try and checkpoint sockets. When built for Condor, DMTCP behaves like Condor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed. General system: *: The user submits a job. Notable changes to their submit file are: {code} # Submit the ship_dmtcp instead of their normal job. executable=shim_dmtcp # Those are IN ADDITION to the users "real" binary and input files! transfer_input_files = dmtcp_checkpoint,dmtcp_coordinator,\ dmtcp_command,dmtcp_restart,dmtcphijack.so,libmtcp.so,\ mtcp_restart # Argument Meaning # --log log file name for actions in shim_dmtcp script, # if n/a use /dev/null # --stdin stdin file, if n/a use /dev/null # --stdout stdout file, if n/a use /dev/null # --stderr stderr file, if n/a use /dev/null # --ckptint checkpointing interval in seconds # 1 the executable name you should have transferred in # 2+ arguments to the executable # # Note that stdout/stderr files are for output from the "real" # binaries. The normal output/error will execlusively be # messages from shim_dmtcp and the DMTCP tools. arguments = --log shim_dmtcp.$(CLUSTER).$(PROCESS).log \ --stdin foo.py \ --stdout job.$(CLUSTER).$(PROCESS).out \ --stderr job.$(CLUSTER).$(PROCESS).err \ --ckptint 1800 \ ./REAL_BINARY example-argument-one example-argument-two # These are all required by DMTCP. JALIB is an internal DMTCP # library ("Jason's library"). If your jobs needs more # environment options set, just append them. environment=DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;\ DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS) # On kill, tell our shim to checkpoint. You can change this, but will # need to change shim_dmtcp as well kill_sig = 2 # If your pool isn't homogenous (nearly identical distributions # and updates), your checkpoints may not be portable. The exact # options needed aren't yet knom, but these may work. Note that # you'll need to identify the exact values yourself; these won't # work for you! Requirements = \ (CheckpointPlatform == "LINUX INTEL 2.6.x normal 0x40000000"\ && OSKernelRelease == "2.6.18-128.el5") {endcode} *: The shim starts up the dmtcp_coordinator. It is given instructions to checkpoint at regular intervals. *: The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?) *: Whe Contents of the DMTCP Condor integration repository (and tarballs at the moment): *: =Makefile= - Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions. *: testing and development tools *:: =foo.py= - Python script used for testing Python under DMTCP *:: =foo.c=- C program used for testing Python under DMTCP