*: The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with {code}git clone /p/condor/repository/dmtcp_condor.git/{endcode} *: When making a new release, you need to change the version number inside of shim_dmtcp. -*: DMTCP has a build option for Condor. The normal behavior is to try and checkpoint sockets. When built for Condor, DMTCP behaves like Condor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed. -*: Some changes to shim_dmtcp come from Michael Hanke, the Debian maintainer for Condor (among other packages) +*: DMTCP has a build option for HTCondor. The normal behavior is to try and checkpoint sockets. When built for HTCondor, DMTCP behaves like HTCondor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed. +*: Some changes to shim_dmtcp come from Michael Hanke, the Debian maintainer for HTCondor (among other packages) Open tasks: *: How can we identify platforms that a checkpoint is compatible with? Machine ClassAd attributes of possible interest include CheckpointPlatform and OSKernelRelease. @@ -14,7 +14,7 @@ *: At the moment the script just waits 3 seconds for dmtcp_command to finish whatever it was doing, be it checkpointing or terminating the running processes. (This is the checkpoint function in shim_dmtcp, look for calls to delay.) dmtcp_command returns immediately, but the actual work may take longer. psilord feels 3 seconds is extremely reliable, but it's not guaranteed. Ideal solution: see if dmtcp_command has a --block-until-work-is-really-done option. If not, ask for one upstream. *: shim_dmtcp line 217 "ckptsig=`${CONDOR_PATH}/bin/condor_status -l $host 2>&1"..., don't bother with egreping, just use -format! *: shim_dmtcp line 280 "/bin/sleep 2" change to "delay 2" -*: The dmtcp_restart_script.sh symlink renaming code around line 252 in shim_dmtcp exists to work around perceived limitations in Condor. dmtcp_restart_script.sh is a symlink, and psilord thinks Condor does something terrible, like returning it as a 0-length file. (A copy would be fine!) Verify claim. If true, report bug to Condor. If not true and hasn't been true for "a while", remove the code. +*: The dmtcp_restart_script.sh symlink renaming code around line 252 in shim_dmtcp exists to work around perceived limitations in HTCondor. dmtcp_restart_script.sh is a symlink, and psilord thinks HTCondor does something terrible, like returning it as a 0-length file. (A copy would be fine!) Verify claim. If true, report bug to HTCondor. If not true and hasn't been true for "a while", remove the code. *: shim_dmtcp line 328, the "Close the gcb" section, is probably moot in the new GCB free world and can be deleted. *: shim_dmtcp line 355 (logit "could not start dmtcp_coordinator"): should it "exit 1"? Should we continue? What will happen? *: shim_dmtcp line 350 (port=`${dmtcp_coordinator} --port 0 --exit-on-last --interval ${CKPTINT} --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`) - We just threw away potentially useful output from the dmtcp_coordinator! Write output and error to a temporary file, grep the port out of that, dump contents of temporary file to stdout so we have it for debugging. Also: ask upstream to add a --port-file option that will write the port and nothing else to a file, to simplify extracting it. @@ -87,7 +87,7 @@ *: The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?) *: When restarting from a checkpoint, shim_dmtcp will notice the dmtcp_restart_script.sh created by DMTCP and will run that instead. -Contents of the DMTCP Condor integration repository (and tarballs at the moment): +Contents of the DMTCP HTCondor integration repository (and tarballs at the moment): *: =shim_dmtcp= - The heart of our system. Submit this instead of your job, passing suitable options, to get checkpointing behavior. *: =job.sub= - Example submit file using shim_dmtcp; used as part of manual testing. *: =Makefile= - Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions.