HTCondorWiki: Creating New Daemons

Creating a New Daemon

From: Erik Paulson
Subject: how to write a new HTCondor daemon

[This sort of happened by accident, and it's not yet complete. Please give me feedback, I hope that it can someday be the start of getting started with the HTCondor codebase document. Also, it's nearly 3AM and I've decided that I dont want to proof-read the thing]

So you want to write a new HTCondor daemon. Well, it happens, and it occurred to me that we should have some documentation on how you do that. And when I say documentation, I don't mean some doc++ in the daemon core headers, but a nice narrative that stops and explains what's going on with things from time to time. This started out as note to Ben on how I wanted him to get started on stuff, but it may as well go to everyone.

Well, let's pretend that your new daemon is going to be called the condor_ca_server. We want our daemon to run on Windows and UNIX, and we want it to be able to use strong authentication and encryption over the network.

The first stop in writing a new daemon in HTCondor is the directory condor_dcskel. This is the bare minimum that you need to have in a Daemon-core based program to make it work.

Here's the first example of that explanation thing I was talking about earlier - what the hell is a daemon-core based program? Well, HTCondor Daemon Core is just what the name suggests - it's a library that gives you the core of a UNIX or NT server process. It gives you a platform-independent way to create processes and react to the termination of those processes, set timers, handle signals, and be able to easily handle client "commands" with just a few functions. Daemon core is goodness and light, and the parts of HTCondor that don't use daemon core are unadultered evil, like the checkpoint server.

One thing that daemon core is not is threaded - DaemonCore is an example of event-driven execution. At it's heart, daemon core is just a big loop that calls select(2) over and over again. (select is a Unix system call that takes a list of file descriptors and a timeout, and blocks until something "interesting" happens, or the timeout occurs. It's the second-coolest system call ever, after mmap(2)). This event-driven model is a bit frustrating for developers, because most people tend to think procedurally, and it's easier for people understand that what they're doing might be interrupted by the computer, but when they start back up they'll be right where they left off. It's not that hard to work in an event-driven model - the key to remember is when you're going to do something that is going to block, instead of relying on the threading package to keep things going for you you just have to register a callback, and you'll logically pick up right where you left off again, just in a new function. The mantra is: get back to the select() as quick as you can - whenever you're not in select, nothing else can happen.

Daemon Core is also not light weight - all this goodness comes at a price, and HTCondor's libraries depend on the fact that everything is there. This has led a lot of the "research" code to avoid using the HTCondor libraries, because the HTCondor libraries are designed to write HTCondor daemons and tools. (For example, NeST is implemented with an entirely fresh code base. In some ways, this sucks, because the NeST developers have to re-learn what took Todd, Derek, Jim, Rajesh, Pete, and Mike Litzkow many years to learn. But at the same time, NeST is threaded.) If you're reading this because you want to write a HTCondor daemon, then the question is already answered for you: you're using daemon core.

Now, back to the condor_dcskel directory: there are two files in there, an Imakefile and a dcskel_main.C. This is enough to get you started. The first step is to make a new directory to put your code in. Since our example is the Mini-CA, we'll call our directory condor_ca. Then we'll copy the Imakefile and dcskel_main.C into that directory, and rename dcskel_main.C to ca_server_main.C

One quick diversion - we'll need to make sure that our code gets built by the top-level Imakefile. In our case, this is pretty easy - we only have to change two lines in src/Imakefile.

Track down the #define Programs line, and add your directory name (without the condor_ prefix) and add your directory to the end of the line.
Then, go find the object_target declarations, and add your directory again.

Now, back into the condor_ca directory - let's remove all the references to the dcskel and make ourselves a real program. First, start out in the Imakefile, and wherever you see something with dcskel get rid of it. Here's what the first Imakefile for the mini CA should look like:

__START__

NAME = condor_ca_server
all_target( $(NAME) )

CFLAGS = $(STD_C_FLAGS)
C_PLUS_FLAGS = $(STD_C_PLUS_FLAGS)
LIB = $(DAEMONCORE_LIB) $(STD_LIBS)

OBJ = ca_server_main.o

c_plus_target($(NAME),$(OBJ),$(LIB))

html:
release:: all
stripped:: all
static:: all


__END__

Nothing too complex - this says we're going to wind up with a condor_ca_server, the objects to go in it is ca_server_main.o, and to build it like any other C++ program.

Now let's edit ca_server_main.C:

The first thing to change, of course, is the mySubSystem variable - this gets used all over the place in debugging - Daemon Core will try and log to $mySubSystem_LOG, and dprintf uses $mySubSystem_DEBUG to decide what debug level to use, and so on.

For a basic daemon core program, that's it! Everything will just work now.

Of course, you don't believe me, and so you investigate a bit. (How could you possibly be done - you didn't even write a main()...) So you read the bit of code, and actually, you'll notice that you don't see a main() in there. That's another bit about daemon_core - if you use it, you don't actually get to control main() anymore - that's way off in the library somewhere. Daemon core is going to call a couple functions that you need to provide, but otherwise it's sticks to itself. The 4 functions it's going to call are:

main_init -- it gets called at startup.
main_config -- it gets called whenever daemon_core wants you to reconfigure
             yourself (QUESTION: does main_config get called at startup
             as well, or are you supposed to call it in main_init if you care,
             and main_config only gets called on future reconfigs?)
main_shutdown_graceful - it gets called when your program needs to exit, but
             your program can takes it's sweet time in shutting things down.
             I don't actually know if there's a time limit or not, but
             there basically isn't
main_shutdown_fast - you need to turn yourself off now. Don't dally about.

So, if you compile your new daemon up and run it (with the -f -t arguments, don't forget!) you'll discover that you're the proud author of a useless program. We should probably make it do something.

Remember how daemon_core works - it's based around select(), and we want to get back to select as often as we can. So, all we need to do is convince select() to call a function we write when something interesting happens, we can have something happen. Remember when I said daemon core lets us have timers and commands and signals and such? Well, the way you use those is with a few Register_X calls - there's a Register_Command, a Register_Timer, a Register_Signal, a Register_Reaper (it gets called when a process exits), and a Register_Socket. We can register a thing we're interested in, and a function to get called, and Daemon Core will call that function when select says something interesting happened on that socket. Those functions are called handlers.

I lied a little bit when I said Daemon Core was a library - right now, we should start calling it by it's true name, daemonCore. daemonCore is a C++ object, and there is one daemonCore, and only one daemonCore in your program. daemonCore is derived from the Service class (which is defined in condor_timer_manager.h, of all places :). This Service class is actually pretty important, because everything function that we register with daemonCore that we want to be a class method as a handler, that class needs to be inherited from the Service class. For now, let's ignore C++ and just register a C function that prints hello. In the main_init class, add this line:

	daemonCore->Register_Command(12345, "SAY_HELLO",
			(=CommandHandler=)&say_hello, "say_hello", NULL, READ, D_FULLDEBUG );

The arguments are:

The command number (usually defined in condor_includes/condor_commands.h
A text description of the command
A "CommandHandler", which is really a function pointer
A text description of the handler
The service class to use. Since this is a C handler, we don't need one.
What Permission level we need to be to call this function (ie HOSTALLOW_READ, HOSTALLOW_ADMINISTRATOR, etc)
What dprintf level to use (I'm not really clear on how this works...)

Of course, we'll need to add the command handler as well, but that's pretty easy:

int
say_hello(Service *, int, Stream *sock)
{
	dprintf(D_ALWAYS, "Hello, =DaemonCore=!\n");

	return true;
}

So, now we can print how "Hello, DaemonCore" on command, but we still have a pretty useless program - how do we get the command to our daemon, so daemonCore will call our function? The answer is CEDAR.

Now, most of you have heard of CEDAR, but I'd bet a good chunk of the team doesn't know just what it is. If the question is, "what does CEDAR do", the answer is "Yes". CEDAR is our all-singing, all-dancing communications library that keeps Hao and Sonny in business full time. CEDAR stands for the HTCondor External DAta Representation library (well, maybe it does.) CEDAR started out life as a replacement for XDR, which was a way to represent data between different hosts, assuming the only thing that was portable was bytes. Sun used it in Sun RPC (so NFS uses/ used it) - thankfully, I'm too young to have ever had to use it (instead, I get XML. Oh Joy.) But CEDAR has grown far beyond that.

CEDAR, for the low low price of just linking it in, can:

Convert a large number of types between different hosts in a reliable method. For example, if a client sends a server a double, and they're totally different architecture, CEDAR will reorder the bytes if need be, and convert it into the right-size (if a long is 4 bytes on one platform, and 8 on another CEDAR gets it right.) CEDAR knows about a ton of different types, and it's Smarter-Than-I-Am - it won't let you do Wrong things, like overflow a signed integer.
CEDAR is also our platform-independent socket and communications library. CEDAR has two types of Sockets: The ReliSock, which gives you a TCP-like stream, and the SafeSock, which gives you a UDP-based message exchange. (The SafeSock name has to do with buffering- you can just give it huge packets and it gets it right, splitting them up into fragments and reassembling it on the other side)
CEDAR can authenticate for you with a whole host of authentication methods - all you have to do is say "Authenticate" to a CEDAR socket, and it can use Kerberos, X509, NT LanMan, File system, Claim-to-be, and hopefully soon a password (actually, hopefully soon a PAM module, which gets us a huge new world of stuff)
CEDAR can encrypt everything that goes over it, either with Blowfish or 3DES. This is separate from the authentication code, so you can authenticate without encrypting, or encrypting with only a shared secret ahead of time.
CEDAR can bandwidth regulate
CEDAR can limit itself to a range of ports
CEDAR (will soon enough) support a connection-broker approach, to allow for third parties to establish connections on a users behalf. We need this for firewall support, where network access is not symmetric.

Now, back to our daemon. We've registered command 12345 as the "Say hello" command. A command is really nothing more than an integer that daemonCore is watching for. When you fire up your daemon on the command line, you'll see it print out "Command socket on <128.105.45.39:22313>" or some such - that's the socket that daemonCore is watching (through the select loop) for something interesting to happen on. So, to send a command to our daemon, all we have to do is open up a socket to <128.105.45.39:22313> (this string, by the way, is the "sinful string" of the daemon, so named because of a field in a sockaddr structure).

As a digression, we didn't actually have to Register a command - that's just a nice shortcut that Daemon Core gives us. If we really wanted to, we could have opened up another socket (not the command socket), and called Register_Socket() on it. Then, as soon as something connected to that socket, daemonCore would have called our socket handler, and we could start reading things off the wire. We could then just try and read an integer off the wire, and we could compare it to 12345, and if we found a match we could print out "Hello" right there. We could also treat that socket handler as our our Register_Command routine, where we call some function depending on what we read off the network. (There are other reasons to use Register_Command, which I'll get into in a bit)

Once we have that socket, we just send the integer 12345 over, and say done, and we should get a "Hello" out of our server. It's late, and I've got a bit more to write yet tonight, so I'm not going to write this code for the example.

Part of the reason I'm not going to write the example code is that you don't want to send a command this way. For one, it's unreasonable to expect people to know the exact socket address of everything they want to talk to - could you imaging having to type a full hostname and port just to talk to the schedd? So there are better ways of finding out who you should talk to - if you want to send a command to something, you use the daemon object. The daemon object is smart - if you want to talk to a schedd, you create a daemon object and tell it is schedd, and to "locate" that schedd. The daemon object has all the smarts of how to find the daemon - it knows, for example, that to find a schedd it should to see if there's a SCHEDD_ADDRESS file, and if not then go to the collector and find out the sinful string of the schedd. The daemon object also includes a slightly different API for sending a command - instead of directly coding up the command integer and sending it, you call daemon->startCommand(12345), and it does the same thing, but also takes care of negotiating all the security stuff for you.

I don't want to write too much about this, though - the I think the daemon object got a rewrite, and the new way is condor_daemon_client. I will make Zach or Derek explain how all that works.

-Erik, writing a howto much longer and later than he expected to be.