HTCondorWiki: Procd Wisdom

Page History

(By Greg Quinn)

This page will serve as my brain dump of all things ProcD. My goal is to highlight things that aren't obvious from looking at the code. In addition, I'm neither dying nor moving farther than 0.2 miles from the Condor project's home in the CS building so please feel free to let me know if something is glaringly missing from this page.

High-Level Summary of ProcD's Operation

When the ProcD starts up, it looks up its parent PID and begins monitoring that process and any child processes. Child processes are discovered by polling the system. The interval at which the ProcD polls the system can be specified via the the command line (actually this is the maximum interval; see below about registering subfamilies).

For a family that the ProcD is monitoring, clients can request the following actions:

Get resource usage totals for the processes in the family
Send a signal to the family's "root" process
Suspend / continue all the family's processes
Kill all the family's processes

The ProcD supports nested families (or subfamilies). For example, the Master may start a ProcD which then begins monitoring the Master's family. When the Master starts a daemon like the StartD, it will want to be able to act on the StartD and all it's children in the ways listed above. So it registers a subfamily with the ProcD. The StartD will similarly register a nested family when it spawns the Starter, as will the Starter when it spawns the job.

After registering a new subfamily, a client of the ProcD can tell the ProcD what "tracking methods" it should use in discovering what processes belong to the family. Methods include things like using environment variables, UIDs, or GIDs. Without specifying any additional methods, only PPIDs will be used.

Relationship to PrivSep

The ProcD was born out of the PrivSep project. The idea was that while most root-enabled things could be accomplished with a stateless root-owned setuid binary (the condor_root_switchboard), our process family tracking code needed a root-running daemon. In hindsight, I think this may have been a mistake (see ticket #110). Even without the ProcD's supposed benefit for PrivSep it is still a big win for scalability, particularly as we move toward more and more cores. Previously, Condor was executing the process tracking code in each Starter in addition to in the StartD and Master. In addition, the ProcD has some other performance advantages over the older process tracking code, like using a hash table when snapshotting to "remember" whether processes that have been seen before as in families we are interested in or not.

There are some important consequences of the ProcD's PrivSep heritage. The fact that the ProcD is included in the "trusted" portion of the PrivSep architecture inspired us to make it include as little code as possible. As such, it does not use dprintf logging, doesn't use Condor's configuration subsystem, and doesn't (currently) use some handy features like the Google cored dumper. Logging is handled in a very basic way and is missing crucial features that dprintf has like rotation. As a result, the ProcD's log file is disabled by default. Configuration is handled via the command line. A Condor daemon starting the ProcD will read its configuration parameters out of the Condor config file then place those on the ProcD's command line.