Standard Universe

Notes from Peter Keller's 2011-11-30 brain-dump.

See also CondorSyscallLibCommandLine

libsyscall

The standard universe requires 'condorized' binaries, which are "statically-linked" binaries (see below for the standard universe definition thereof) created by condor_compile, which, despite the name, is a linker, rather than a compiler. It creates binaries with the API stack on the right.


normal                condorized
---------------       ------------------
| application |       |  application   |
|             |       |   |------------|
|             |       |   | libsyscall |
---------------       |------------|   |
|    libc     |       |    libc    |   |
---------------       ------------------
|   kernel    |       |   kernel       |
---------------       ------------------

The interposition layer, libsyscall, comes before libc on the link line; the application thus calls its functions in preference to libc's. (Note that libsyscall does not override every libc function; only those necessary for checkpointing and/or remote I/O.)

When starting a standard universe job, the standard universe -specific shadow establishes a connection with the startd. This connection is inherited by the standard universe -specific starter, which in turn bequeaths it to the application (specifically, libsyscall). The starter's connection is used for startup, shutdown, and suspend/resume. All other communication to the shadow comes from libsyscall.

There may be machinery in libsyscall to make it signal-atomic.

"Statically-Linked"

Because the OS-provided information on (in-memory) segment boundaries is frequently (but inconsistently) wrong, the checkpointer must guess them. Because it runs in the signal handler, it is impossibly tricky to handle segfaults; therefore, it must guess them correctly. To make this possible, a standard universe application must keep all its voilatile data on the stack or in the .data segment, and it must be laid out in memory as follows.


---------------
| environment |
|-------------| <- guessed; may be an unmapped page here
|  the stack  |
|             |
|--\/\/\/\/\/-| <- guessed
|             |
|-/\/\/\/\/\--| <- sbrk
|    .data    |
|-------------|
| .bss&.bzero |
|-------------| <- _data_start (linker symbol)
|    .text    |
---------------

This layout is known to Linux as the "vm compat" personality. To force an application to run this way, you can use the setarch program. (To run a standard universe program in stand-alone mode, setarch <arch> -L -B -R.) The standard universe starter will otherwise take care of this.

These general requirements have some specific consequences:

No calls to mmap().

Because the kernel lies about segment boundaries, we can't permit calls to mmap(). (With one exception: requests for anonymous segments can be satisfied by sbrk().) We can't just intercept calls to mmap() and record the results because ldd has its own mmap() implementation. This implies the following.

No dynamic linking.

Obvious, but also includes calls to dlopen(). This implies patches to Ulrich Drepper's glibc, because it likes to dlopen() libnss (to get different resolvers). The patches allows glibc to be built entirely statically, but necessitate our own copy.

The gcc and g++ runtimes must also be static.

This is implied by the above, but worth calling out because it can require you to recompile the compiler. Many distributions now have (optional) packages that include the static runtimes, instead.

"VM compat" layout.

As mentioned above, the application must be laid out in memory in a specific way; in particular, the .bss (and/or .bzero) segments must (directly) abut the .data segment. This used to always be true; hence the 'compat' in the personality name.

No VA randomization or ExecShield.

Both virtual address randomization and ExecShield (which used to be but no longer are the same thing, apparently) screw up the required in-memory layout and can not be used.

Consistent .vdso section address.

The .vdso segment -- used by Linux to speed system calls -- is not checkpointed (being kernel memory space) but must be mapped at the exact same address on restart.

Remote I/O

To perform remote I/O, libsyscall interposes itself between the application and libc. The set of functions in libsyscall which duplicate (part of) the libc API are called switches, and generally behave identically. (So much so that most of them can be automatically generated; see stubgen, below.) They're called switches because they switch between four different behaviours based on the global state. The global state can be either 'mapped' or 'unmapped' and either 'local' or 'remote'. The latter should be obvious. The former not only ensures that local and remote FDs don't collide, but simplifies checkpointing. (Rather than try to change FDs in the application, libsyscall remembers enough about all the FDs it opened on the application's behalf to reopen them when the application restarts. It might get different FDs, but it will ensure that the application never sees them.)

When a standard universe job calls, say, open(), libsyscall will start in mapped-remote mode. It will ask the virtual file table (see below) for the translated FD. The virtual file table returns the translated FD (and, if the job's submit file specified that the file in question ought to be stored locally, switches the mode to local). The open() switch then changes the mode to unmapped (having unmapped the FD) and calls itself again. This time, it will take the unmapped-remote path and make an RPC. Or, if the file was specified to be local, the unmapped-local path; the unmapped-local path makes the corresponding system call directly. The mode changes will be undone as the call-stack unwinds.

In stand-alone mode, libsyscall defaults to mapped-local instead of mapped-remote, but otherwise functions identically.

The remote system calls proper are implemented as 'senders' and 'receivers', which are almost entirely automatically-generated. Senders exist in the starter and libsyscall; the receivers exist only in the shadow.

Some switches can not be automatically generated, and are known as "special" switches. For Linux, there are quite a few of these, mostly handling the [x|l|f]stat family, and others which have different kernel- and user- space data structures; other specials include functions normally implemented as macros; and some switches are special because they don't behave exactly the same as the normal system call. (For example, gettimeofday() caches the offset between the local clock and the shadow's clock, because many applications use gettimeofday() to self-profile their inner loops. Likewise, isatty() should always return false when running under HTCondor, but this can make some Fortran run-times behave oddly.) There are very few "special" senders or receivers. Normal switches, senders, and receivers are generated from a '.tmpl' file by /stubgen/.

The new cmake dependencies are all screwed up, so when mucking about with this code, it's wise to make clean to ensure that all your changes are propogated everywhere.

It's sometimes necessary to call glibc functions from the switches (rather than make a sytem call directly). /stubgen/ supports this as a primitive, and will do horrible things with nm to extract (and rename) the libc function(s) as appropriate.

A 'remap' defines a bunch of BS function names that glibc has a (weak) symbol for to make sure the proper switch is called. (That is, all glibc entry points for the same function must call the same libsyscall function.) Make having a 'real function' of the appropriate name that just wraps the switch, we ensure that the linker does what we want.

The remote I/O system also implements 'pseudo' system calls, which are handled entirely by the shadow. This includes the suspend and resume operations.

Remote I/O RPC #s are defined in syscall_numbers.h and have nothing to do with what anybody else uses for system call numbers. do_remote_syscall() is a big switch that calls the appropriate reciever.

We turn off checkpointing when sockets or pipes are open, and sockets are opened by the application, not the shadow.

Arguments are marshalled via the overloaded function sock->code(), implemented in condor_io/stream.cpp; this code should generally not need to be changed.

If you start a standard universe job with -_condor-aggravate-bugs, the virtual file table will deliberately avoid identity mappings. This can expose bugs in the interpositioning layer, as invalid FDs become more likely.

The virtual file table also checks if a file is both read and written by an application, and will warn if one is, as this can cause inconsistencies across a restart.

There is also a libzsyscall.a which is directly analogous to libsyscall.a, except it performs compression of the checkpoints as they are read/written.

Checkpointing

One of the things in libsyscall is a replacement for crt0.o, which has a special main() function. This function configures the signal handlers for SIGTSTP and SIGUSR2 (which checkpoint & stop, or checkpoint & resume). It also determines if it's a restart or not. Because we're overriding main(), global constructors can be called multiple times per run, but that's probably not important.

The basic checkpointing routine is to:

  1. Call setjmp().
  2a. If this is a return from longjmp(), clean up and exit the signal handler.
  2b. Find the segment boundaries (using machdep).
  3. Open FD to checkpoint server or file.
  4. Write pages.
  5. Either exit or call longjmp().

To restart, read the pages back into memory (sbrk()ing as necessary) and then longjmp() to the stored jump buffer. The OS will take care of restoring application state from the jump buffer and the return out of the signal handler.

Checkpoint Server

The checkpoint server is grad-student code, and the '2' files are the important ones. It's not daemon core and has no security whatsoever. It listens on a few canonical sockets. The shadow gets a token (capability) from the checkpoint server and gives it to the starter/job, which can then use that token to fetch its checkpoint image. The schedd can also talk to the checkpoint server, but only to remove a checkpoint image when the job is removed. The shadow generally manages the checkpoint server for the standard universe.

Debugging

No valgrind/purify/memcheck will work, due to obvious abuses.

gdb gets confused in the checkpoint routines.

dprintf() in the checkpoint library is not the same dprintf() as the rest of HTCondor.

-_condor_D_[CKPT|ALWAYS|FULLDEBUG] are your friends when executing a job.

FIXME

(16:29:56) Pete Keller: Compression of checkpoints is handled by using an alternate memory heap that is allocated by a raw call to syscall(SYS_mmap, ...). The alternate heap is not checkpointed, restored, or book kept.
(16:31:41) Pete Keller: When restarting a checkpointed process, we move to an alternate stack defined in the bzero segment so we can restore the STACK segment properly and still be able to have a stack of our own to finish the work.
(16:32:58) Pete Keller: Also, when resuming, we resume, via the longjmp() INTO the signal handler of the previous control flow, then return from the signal handler back to the regular code.
(16:34:36) Pete Keller: A caveat: Glibc encrypts the stack pointer in the jmpbuf structure. The macros PTR_ENCRYPT and PTR_DECRYPT() in machdep.h deal with the nastiness.
(16:34:50) Pete Keller: I forgot to mention why the test programs are what they are. :(
(16:35:28) Pete Keller: Most of them are commented as to what they test. the job_rsc-all-syscalls_std.c test is a VERY IMPORTANT one that does a unit test of all remote i/o or other interposed libc calls. If anything fails in there, it is very bad.
(16:35:45) Pete Keller: The job_rsc_* and job_ckpt_* tests are for stduniv.
(16:37:00) Pete Keller: The "sanity" ckpt test seemingly doesn't test anything. However, that program alone is responsible for finding a hugh number of bugs. It just happens to tickle a lot of subsystems in glibc, even though it is fucking simple. Don't ever remove that one from the test suite even if you don't know what it does. :)
(16:37:15) Pete Keller: All tests are there for a reason at one time or another. :)
(16:37:54) Pete Keller: Most of what the tests test are obvious and decently commented. A few are redundant, but that's ok.
(16:38:38) Pete Keller: A big thing to realize in the testing of stduniv, is that it is only reliable to a statistical confidence value. running a test 10 times and seeing it succeed doesn't mean squat. You must see it run 10 million times and in changing environments.
(16:39:33) Pete Keller: stduniv is nearly undebuggable, the only means to truly know if it works is to determine if the test program did the correct thing (ALWAYS via external verification to the test program) and then run it in the millions of times to ensure you don't shake out crazy little segfaulting bugs.
(16:40:17) Pete Keller: The current set of tests do a pretty good job of it, though.