This is just notes on checkpointing options, both that exist for Condor and in other systems: *: HTCondor's standard universe: Links in a custom library that replaces system calls with wrapped versions so state can be saved; checkpoints (and restores) inside of a signal handler. *: {link: http://dmtcp.sourceforge.net/ DMTCP (Distributed MultiThreaded CheckPointing} (See DmtcpCondor ): Uses ptrace(?) *: {link: http://criu.org CRIU (Checkpoint/Restore In Userspace)} - See also {link: https://lwn.net/Articles/525675/ This LWN article}: Some kernel changes