HTCondorWiki: Hfc Design And Progress

Page History

High Frequency Computing Design and Progress

This page documents the design and progress of the high frequency computing project.

Overview

Justification

Condor is very good at distributing and managing large jobs that might take hours or more to complete. Managing these jobs includes moving job binaries and input files to an execute machine, executing the job, and returning any output to the submit machine. In this case, the overhead of staging the execute machine with binaries and input, and the return of the output is made trivial by the amount of time the job requires to run. There are, however, situations in which a job may only take a few seconds or less. This job might need to be run hundreds or thousands of times with variability in the data used to produce the results. It may also be desired that these jobs complete in as little time as possible. In this case, the overhead of moving the job binary each time the task is required to run may not be as trivial as before. To aid in execution of these jobs, we will add the concept of High Frequency Computing to Condor.

High Frequency Computing

At it's core, High Frequency Computing (HFC) is an attempt to run as many jobs as possible in as short a time as possible. These jobs are expected to take a very short time to complete (a few seconds or less) and work on relatively small input data (a couple hundred to a couple thousand bytes). These special jobs will be called Tasks. For an HFC job, the user will provide a Task or batch of Tasks that will be submitted to Condor. Once the tasks have been completed, the results will be returned to the user.

Tasks will be defined as the data required to complete the task. In this way it is different from Condor jobs as the task itself isn't capable of doing anything. The task must be given to a worker that knows how to interpret the task and produce a result from it. Workers can be thought of as task servers; send a worker a task and it will send the results back. In this way the transfer of the worker (the executable) is only done once. Once the worker is established, it will be able to process any tasks sent to it. By removing the need to transfer binaries and input files, the time to complete a task will be greatly reduced.

The user will be provided a method of staging the workers (most likely via condor_submit) and submitting tasks. Condor will perform the majority of the work required to manage the tasks. This will include all scheduling duties. Task result processing will occur inside a user created Results Processor. The results processor will be fed the results as they become available.

The user will be responsible for providing the worker executable and the result processing executable. All other duties will be performed by Condor.

Summary

In general the following summarizes the usage of the HFC system:

User: Submits as many Worker jobs as desired.
User: Submits as many Tasks as desired and provides a results processor.
Condor: Schedules and submits tasks to available workers.
Worker: Processes tasks as Condor submits them and returns results.
Condor: Receives results and forwards them to the results processor.
Results Processor: Processes results from Condor into whatever form the user wishes.

Requirements

Requirements for this project are managed here HfcRequirements

Design

Progress

Benchmarking task delay #1096 - Done
Adding gahp server to shadow #1244

Glossary

HFC - Stands for High Frequency Computing.

High Frequency Computing - distributed computing of many small tasks.

Result - Produced by the staged worker after the worker has finished processing the Task.

Result Processor - A user generated executable that will be fed the results of submitted tasks.

Task - a job in the HFC context. A task is the data required for a staged worker to produce needed output. Tasks are encapsulated in classads for transport.

Worker - a binary that is submitted as a Vanilla Universe job. The worker runs continuously waiting for tasks to be sent to it. Once a task is received, it processes it and produces a result.