Warning: DMTCP 1.2.8 or later required
DMTCP 1.2.8 or later is required. Earlier versions do not work with our shim script, version 0.6 and later. Earlier versions of our shim script (0.5 and earlier) do not reliably work under HTCondor.
(Information on why this was necessary is at #3747)
Warning: Pools with different CPUs
Moving jobs between different processors can cause the jobs to crash because of incompatibilities. For example, if your job checkpoints on a system that uses SSE4, glibc will cache that information and use SSE4 optimized code paths. If you then move to a machine lacking that support, the next time you call into such a function (including many common string routines) the job will crash. We're working on an easier solution (See #3753) but for now you if you have a CPU-heterogenous pool, you'll need to use your job's Requirements to check the CheckpointPlatform.
You can use
condor_status -af CheckpointPlatform | sort |uniq -c
to identify which machines you want to use. For example, you might decide that machines identified as "LINUX X86_64 2.6.x normal N/A ssse3 sse4_1 sse4_2" are the ones you want to target, in which case you would use something like this in your submit file:
requirements = CheckpointPlatform == "LINUX X86_64 2.6.x normal N/A ssse3 sse4_1 sse4_2"
DMTCP is a third party user space checkpoint package available from here: http://dmtcp.sourceforge.net/
The current integration model of DMTCP with HTCondor is a shim script with additions to the submit description file for a vanilla universe job. This allows your unmodified vanilla universe job, which could be: dynamically linked executables, processes that fork, Matlab(tm) scripts, etc, to checkpoint and migrate between machines.
The download tarball contains a
README file which contains a manifest and more detailed instructions to how to use the integration software. This software is considered alpha, so the documentation or other things might be
a little rough for a bit.
|dmtcp_condor_integration-0.6||Jul 11th, 2013||1.2.8||Critical fixes|
|dmtcp_condor_integration-0.5||Mar 25th, 2013||1.2.7||Updates to support more recent releases of DMTCP|
|dmtcp_condor_integration-0.4||Nov 1st, 2011|
|dmtcp_condor_integration-0.3||Oct 13th, 2011|
|dmtcp_condor_integration-0.2||Aug 26th, 2011|
|dmtcp_condor_integration-0.1||May 25th, 2011|
Alan De Smet, a HTCondor project member, is the current maintainer of the DMTCP/HTCondor integration software. Inquiries, bug reports, or comments about the DMTCP/HTCondor integration package should be sent to condor-admin at cs.wisc.edu.
For help or bug reports relating to DMTCP itself, please subscribe to the dmtcp-forum mailing list located here: http://dmtcp.sourceforge.net/mailingLists.html
The homepage for DMTCP is: http://dmtcp.sourceforge.net
The homepage for HTCondor is: http://www.cs.wisc.edu/condor
The DMTCP/HTCondor integration software is released under the Apache 2.0 License.
People who have contributed to the source code of the DMTCP/HTCondor integration
- Gene Cooperman
- Michael Hanke
See DmtcpCondorDev for notes on developing the DMTCP support in HTCondor.