HTCondorWiki: How To Manage Gpus

Page History

* WORK IN PROGRESS *

Condor can help manage GPUs (graphics processing units) in your pool of execute nodes, making them available to jobs that can use them using an API like OpenCL or CUDA.

Condor matches execute nodes (described by ClassAds) to jobs (also described by ClassAds). The general technique to manage GPUs is:

1. Advertise the GPU: Configure Condor so that execute nodes include information about available GPUs in their ClassAd. 2. Require a GPU: Jobs modify their Requirements to require a suitable GPU 3. Identify the GPU: Jobs modify their arguments or environment to learn which GPU it may use.

Advertising the GPU

A key challenge of advertising GPUs is that a GPU can only be used by one job at a time. If an execute node has multiple slots (a likely case!), you'll want to limit each GPU to only being advertised to a single slot.

You have several options for advertising your GPUs. In increasing order of complexity they are:

Static configuration
Automatic configuration
Dynamic advertising

This progression may be a useful way to do initial setup and testing. Start with a static configuration to ensure everything works. Move to an automatic configuration to develop and test partial automation. Finally a few small changes should make it possible to turn your automatic configuration into dynamic advertising.

Static configuration

If you have a small number of nodes, or perhaps a large number of identical nodes, you can add static attributes manually using STARTD_ATTRS on a per slot basis. In the simplest case, it might just be:

SLOT1_HAS_GPU=TRUE
SLOT1_GPU_DEV=0
STARTD_ATTRS=HAS_GPU,GPU_DEV

This limits the GPU to only being advertised by the first slot. A job can use HAS_GPU to identify available slots with GPUs. The job can use GPU_DEV to identify which GPU device to use. (A job could use the presence of GPU_DEV to identify slots with GPUs instead of HAS_GPU, but "HAS_CPU" is a bit easier to read than "(GPU_DEV=!=UNDEFINED)"

If you have two GPUs, you might give the first two slots a GPU each.

SLOT1_HAS_GPU=TRUE
SLOT1_GPU_DEV=0
SLOT2_HAS_GPU=TRUE
SLOT2_GPU_DEV=0
STARTD_ATTRS=HAS_GPU,GPU_DEV

You can also provide more information about your GPUs so that a job can distinguish between different GPUs:

SLOT1_GPU_CUDA_DRV=3.20
SLOT1_GPU_CUDA_RUN=3.20
SLOT1_GPU_DEV=0
SLOT1_GPU_NAME="Tesla C2050"
SLOT1_GPU_CAPABILITY=2.0
SLOT1_GPU_GLOBALMEM_MB=2687
SLOT1_GPU_MULTIPROC=14
SLOT1_GPU_NUMCORES=32
SLOT1_GPU_CLOCK_GHZ=1.15
STARTD_ATTRS = GPU_DEV, GPU_NAME, GPU_CAPABILITY, GPU_GLOBALMEM_MB, \
  GPU_MULTIPROC, GPU_NUMCORES, GPU_CLOCK_GHZ, GPU_CUDA_DRV, \
  GPU_CUDA_RUN, GPU_MULTIPROC, GPU_NUMCORES

(The above is from Carsten Aulbert's post "RFC: Adding GPUs into Condor".)

Automatic configuration

You can write a program to write your configuration file. This is still using STARTD_ATTRS, but potentially scales better for mixed pools. For an extended example, see Carsten Aulbert's post "RFC: Adding GPUs into Condor" in which he does exactly this.

Dynamic advertising

One step beyond automatic configuration is dynamic configuration. Instead of a static or automated configuration, Condor itself can run your program and incorporate the information. This is Condor's "Daemon ClassAd Hooks" functionality, previous known as HawkEye and Condor Cron. This is the route taken by the condorgpu project (Note that the condorgpu project has no affiliation with Condor. We have not tested or reviewed that code and cannot promise anything about it!)

The Future

The Condor team is working on various improvements in how Condor can manage GPUs. If you have TODO: condor-admin/condor-users, link to ticket when made public.

Credits

Several examples were drawn from Carsten Aulbert's post "RFC: Adding GPUs into Condor" sent to the condor-users mailing list on March 25th, 2011.