HTCondorWiki: Developer Technical Overview

 
 The key to high-throughput is efficient use of available resources. Years ago, the scientific community relied on large mainframe computers to do computational work. A large number of individuals and groups would have to pool their financial resources to afford such a computer. It was not uncommon to find just one such machine at even the largest research institutions. Scientists would have to wait for their turn on the mainframe, and would only have a certain amount of time allotted to them. They had to limit the size of their problems to make sure it would complete in the given time. While this environment was inconvenient for the users, it was very efficient, since the mainframe was busy nearly all the time.
 
-As computers became smaller, faster and cheaper, scientists moved away from mainframes and started buying personal computers. An individual or a small group could afford a computing resource that was available whenever they wanted it. It might be slower than the mainframe, but since they had exclusive access, it was worth it. Now, instead of one giant computer for a large institution, there might be hundreds of personal computers. This is an environment of distributed ownership, where individuals throughout the organization own their own resources. The total computational power of the institution as a whole might rise dramatically as a result of such a change, but the resources available to the individual users remained roughly the same. While this environment is more convenient for the users, it is also much less efficient. Many machines sit idle for long periods of time while their users are busy doing other things. [1] Condor takes these wasted computational resources and turns them into an HTC environment.
+As computers became smaller, faster and cheaper, scientists moved away from mainframes and started buying personal computers. An individual or a small group could afford a computing resource that was available whenever they wanted it. It might be slower than the mainframe, but since they had exclusive access, it was worth it. Now, instead of one giant computer for a large institution, there might be hundreds of personal computers. This is an environment of distributed ownership, where individuals throughout the organization own their own resources. The total computational power of the institution as a whole might rise dramatically as a result of such a change, but the resources available to the individual users remained roughly the same. While this environment is more convenient for the users, it is also much less efficient. Many machines sit idle for long periods of time while their users are busy doing other things. (See footnote 1). Condor takes these wasted computational resources and turns them into an HTC environment.
 
 To achieve the most throughput, Condor provides two important functions. Firstly, it makes the available resources more efficient by finding idle machines and putting them to work. Secondly, it expands the resources available to a given user, by functioning well in an environment of distributed ownership.
 
@@ -31,7 +31,7 @@
 
 {subection: What is Checkpointing?}
 
-Simply put, checkpointing involves saving all the work a job has done up until a given point. Normally, when performing long-running computations, if a machine crashes, or must be rebooted for an administrative task, all the work that has been done is lost. The job must be restarted from scratch which can mean days, weeks, or even months of computation wasted. With checkpointing, Condor ensures that positive progress is always made on jobs, and that you only loose the computation that has been performed since the last checkpoint. Condor can be configured to periodically checkpoint, you can issue a command to asynchronously checkpoint a job on any given machine, or you can even call a function within your code to perform a checkpoint as it runs. Checkpointing also happens whenever a job is moved from one machine to another, which is known as "process migration". [2]
+Simply put, checkpointing involves saving all the work a job has done up until a given point. Normally, when performing long-running computations, if a machine crashes, or must be rebooted for an administrative task, all the work that has been done is lost. The job must be restarted from scratch which can mean days, weeks, or even months of computation wasted. With checkpointing, Condor ensures that positive progress is always made on jobs, and that you only loose the computation that has been performed since the last checkpoint. Condor can be configured to periodically checkpoint, you can issue a command to asynchronously checkpoint a job on any given machine, or you can even call a function within your code to perform a checkpoint as it runs. Checkpointing also happens whenever a job is moved from one machine to another, which is known as "process migration". (See footnote 2).
 
 Checkpointing is accomplished by saving all the information about the state of the job to file. This includes all the registers currently in use, a complete memory image, and information about all open file descriptors. This file, called a "checkpoint file", is written to disk. The file can be quite large, since it holds a complete image of the process's virtual memory address space. Normally, the checkpoint file is returned to the machine the job was submitted from. A "checkpoint server" can be installed at a Condor pool, which is a single machine where all checkpoints are stored. An administrator can set up a machine with lots of disk space to be a checkpoint server and then individual machines in the pool do not need any additional disk space to hold the checkpoints of jobs they submit.
 
@@ -43,7 +43,7 @@
 
 Every machine in the pool has certain properties: its architecture, operating system, amount of memory, the speed of its CPU, amount of free swap and disk space, and other characteristics. Similarly, every job has certain requirements and preferences. A job must run on a machine with the same architecture and operating system it was compiled for. Beyond that, jobs might have requirements as to how much memory they need to run efficiently, how much swap space they will need, etc. Preferences are characteristics the job owner would like the executing machine to have but which are not absolutely necessary. If no machines that match the preferences are available, the job will still function on another machine. The owner of a job specifies the requirements and preferences of the job when it is submitted. The properties of the computing resources are reported to the central manager by the startd on each machine in the pool. The negotiator's task is not only to find idle machines, but machines with properties that match the requirements of the jobs, and if possible, the job preferences.
 
-When a match is made between a job and a machine, the Condor daemons on each machine are sent a message by the central manager. The schedd on the submitting machine starts up another daemon, called the "shadow". This acts as the connection to the submitting machine for the remote job, the shadow of the remote job on the local submitting machine. The startd on the executing machine also creates another daemon, the "starter". The starter actually starts the Condor job, which involves transferring the binary from the submitting machine. (See figure 2). The starter is also responsible for monitoring the job, maintaining statistics about it, making sure there is space for the checkpoint file, and sending the checkpoint file back to the submitting machine (or the checkpoint server, if one exists). In the event that a machine is reclaimed by its owner, it is the starter that vacates the job from that machine. [3]
+When a match is made between a job and a machine, the Condor daemons on each machine are sent a message by the central manager. The schedd on the submitting machine starts up another daemon, called the "shadow". This acts as the connection to the submitting machine for the remote job, the shadow of the remote job on the local submitting machine. The startd on the executing machine also creates another daemon, the "starter". The starter actually starts the Condor job, which involves transferring the binary from the submitting machine. (See figure 2). The starter is also responsible for monitoring the job, maintaining statistics about it, making sure there is space for the checkpoint file, and sending the checkpoint file back to the submitting machine (or the checkpoint server, if one exists). In the event that a machine is reclaimed by its owner, it is the starter that vacates the job from that machine. (See footnote 3).
 
 
 Figure 2: {image:img2.gif}
@@ -54,7 +54,7 @@
 
 {subsection: How Do Remote System Calls Work?}
 
-While running on the executing machine, nearly every system call a job performs is caught by Condor. This is done by linking the job against the Condor library, not the standard C library. The Condor library contains function stubs for all the system calls, much like the C library contains function wrappers for the system calls. These stubs send a message back to the shadow, asking it to perform the requested system call. The shadow executes the system call on the submitting machine, takes the result and sends it back to the execution machine. The system call stub inside the Condor executable gets the result, and returns control back to the job.[4] (See figure 3). From the job's point of view, it made a system call, waited for the system to give it an answer, and continued computation. The job has no idea that the system that performed the call was actually the submit machine, instead of the machine where it is running. In this way, all I/O the job performs is done on the submitting machine, not the executing machine. This is the key to Condor's power in overcoming the problems of distributed ownership. Condor users only have access to the filesystem on the machine that jobs are submitted from. Jobs can not access the filesystem on the machine where they execute because any system calls that are made to access the filesystem are simply sent back to the submitting machine and executed there.
+While running on the executing machine, nearly every system call a job performs is caught by Condor. This is done by linking the job against the Condor library, not the standard C library. The Condor library contains function stubs for all the system calls, much like the C library contains function wrappers for the system calls. These stubs send a message back to the shadow, asking it to perform the requested system call. The shadow executes the system call on the submitting machine, takes the result and sends it back to the execution machine. The system call stub inside the Condor executable gets the result, and returns control back to the job. (See footnote 4). (See figure 3). From the job's point of view, it made a system call, waited for the system to give it an answer, and continued computation. The job has no idea that the system that performed the call was actually the submit machine, instead of the machine where it is running. In this way, all I/O the job performs is done on the submitting machine, not the executing machine. This is the key to Condor's power in overcoming the problems of distributed ownership. Condor users only have access to the filesystem on the machine that jobs are submitted from. Jobs can not access the filesystem on the machine where they execute because any system calls that are made to access the filesystem are simply sent back to the submitting machine and executed there.
 
 
 Figure 3: {image:fig3.gif}
@@ -88,10 +88,10 @@
 Because of the increasing tendency towards personal computers, more and more computing environments are becoming fragmented and distributively owned. This is resulting in more and more wasted computing power. Condor harnesses this power, and turns it into an effective High Throughput Computing environment. By utilizing remote system calls, the Condor system provides uniform access to resources in a distributively owned environment. All system calls performed by the job are executed on the machine where the job was submitted. Therefore, for the entire life of the job, regardless of where it is actually running, it has access to the local filesystem of its owner's machine. In this way, Condor can pool together resources and make them available to a much larger community. By expanding the resources available to users at any given time, more computing throughput is achieved. This is the key to the quality of research for many scientists today.
 {subsection: References}
 
-   1. Livny, M. and Mutka, M. W., ``The Available Capacity of a Privately Owned Workstation Environment,'' Performance Evaluation, vol. 12, no. 4 pp. 269-284, July, 1991.
+1. Livny, M. and Mutka, M. W., ``The Available Capacity of a Privately Owned Workstation Environment,'' Performance Evaluation, vol. 12, no. 4 pp. 269-284, July, 1991.
 
-   2. "Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System" Dr Dobbs Journal, February 1995
+2. "Checkpointing and Migration of UNIX Processes in the Condor Distributed Processing System" Dr Dobbs Journal, February 1995
 
-   3. Litzkow, M., Livny, M., and Mutka, M. W., ``Condor - A Hunter of Idle Workstations,'' Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111, June, 1988.
+3. Litzkow, M., Livny, M., and Mutka, M. W., ``Condor - A Hunter of Idle Workstations,'' Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111, June, 1988.
 
-   4. Litzkow, M., ``Remote Unix - Turning Idle Workstations into Cycle Servers,'' Proceedings of Usenix Summer Conference, 1987.
+4. Litzkow, M., ``Remote Unix - Turning Idle Workstations into Cycle Servers,'' Proceedings of Usenix Summer Conference, 1987.