{code}git clone /p/condor/repository/dmtcp_condor.git/{endcode} *: When making a new release, you need to change the version number inside of shim_dmtcp. *: DMTCP has a build option for Condor. The normal behavior is to try and checkpoint sockets. When built for Condor, DMTCP behaves like Condor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed. +*: Some changes to shim_dmtcp come from Michael Hanke, the Debian maintainer for Condor (among other packages) + +Open tasks: +*: How can we identify platforms that a checkpoint is compatible with? Machine ClassAd attributes of possible interest include CheckpointPlatform and OSKernelRelease. +*: Should the signal to trap in shim_dmtcp should be overridable via a command line option? Are there any circumstances in which we care? +*: The "normal" standard output and error should be from the real job, with outptu from shim_dmtcp and the related tools should go into a user-specified file. At the moment the situation is backward. Note that If shim_dmtcp utterly fails, it should report _something_ to the main stderr. +*: At the moment the script just waits 3 seconds for dmtcp_command to finish whatever it was doing, be it checkpointing or terminating the running processes. (This is the checkpoint function in shim_dmtcp, look for calls to delay.) dmtcp_command returns immediately, but the actual work may take longer. psilord feels 3 seconds is extremely reliable, but it's not guaranteed. Ideal solution: see if dmtcp_command has a --block-until-work-is-really-done option. If not, ask for one upstream. +*: shim_dmtcp line 217 "ckptsig=`${CONDOR_PATH}/bin/condor_status -l $host 2>&1"..., don't bother with egreping, just use -format! +*: shim_dmtcp line 280 "/bin/sleep 2" change to "delay 2" +*: The dmtcp_restart_script.sh symlink renaming code around line 252 in shim_dmtcp exists to work around perceived limitations in Condor. dmtcp_restart_script.sh is a symlink, and psilord thinks Condor does something terrible, like returning it as a 0-length file. (A copy would be fine!) Verify claim. If true, report bug to Condor. If not true and hasn't been true for "a while", remove the code. +*: shim_dmtcp line 328, the "Close the gcb" section, is probably moot in the new GCB free world and can be deleted. +*: shim_dmtcp line 355 (logit "could not start dmtcp_coordinator"): should it "exit 1"? Should we continue? What will happen? +*: shim_dmtcp line 350 (port=`${dmtcp_coordinator} --port 0 --exit-on-last --interval ${CKPTINT} --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`) - We just threw away potentially useful output from the dmtcp_coordinator! Write output and error to a temporary file, grep the port out of that, dump contents of temporary file to stdout so we have it for debugging. Also: ask upstream to add a --port-file option that will write the port and nothing else to a file, to simplify extracting it. +*: shim_dmtcp 377 (${dmtcp_checkpoint} --port $port -j "$@" <$STDIN 1>$STDOUT 2>$STDERR &) - Why background this? psilord believes that it is necessary for our signal trap to work. Verify. If true, add a comment noting that. +*: shim_dmtcp 383 (if [ $ret -gt 128 ] ; then) the purpose of this code block is not clear. Something about handling the user executable exiting with a signal. Needs investigation. + +*: Improve output messages in shim_dmtcp: +*:: line 109 "echo "Terminating..." >&2". Should clarify that it's an internal error, and that getopt failed +*:: line 128 "*) echo "Internal error! ($1)"; exit 1;;" - Not an internal error; invalid usage/bad arguments. +*:: line 133 "printf "Need at least one argument.\n\n"" - append ", the program to run." +*: Comment typos +*:: shim_dmtcp line 195 "# Try to idenitify"..., spell identify. + + + General system: *: The user submits a job. Notable changes to their submit file are: @@ -59,10 +84,12 @@ *: The shim starts up the dmtcp_coordinator. It is given instructions to checkpoint at regular intervals. *: The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?) -*: Whe +*: When restarting from a checkpoint, shim_dmtcp will notice the dmtcp_restart_script.sh created by DMTCP and will run that instead. Contents of the DMTCP Condor integration repository (and tarballs at the moment): +*: =shim_dmtcp= - The heart of our system. Submit this instead of your job, passing suitable options, to get checkpointing behavior. +*: =job.sub= - Example submit file using shim_dmtcp; used as part of manual testing. *: =Makefile= - Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions. *: testing and development tools *:: =foo.py= - Python script used for testing Python under DMTCP -*:: =foo.c=- C program used for testing Python under DMTCP +*:: =foo.c= - C program used for testing Python under DMTCP