HTCondor can help manage GPUs (graphics processing units) in your pool of execute nodes, making them available to jobs that can use them using an API like {link:http://www.khronos.org/opencl/ OpenCL} or {link:http://www.nvidia.com/object/cuda_home_new.html CUDA}.

The techniques described on this page depend on HTCondor version 8.1.4 or later, if you are using an earlier version of HTCondor you should upgrade. If you cannot upgrade for some reason, an older, less flexible technique is
described at HowToManageGpusInSeriesSeven

HTCondor matches execute nodes (described by {quote: ClassAds}) to jobs (also described by {quote: ClassAds}). The general technique to manage GPUs is:

1: Advertise the GPU: Configure HTCondor so that execute nodes include information about available GPUs in their ClassAd.
2: Require a GPU: Jobs modify their Requirements to require a suitable GPU
3: Identify the GPU: Jobs modify their arguments or environment to learn which GPU it may use.

{section: Advertising the GPU}

HTCondor contains a tool designed to assist in detecting GPUs and configuring HTCondor to advertise GPU information. This tool is

    condor_gpu_discovery

This tool will detect CUDA and OpenCL devices and output a list of GPU identifiers for all detected devices.

HTCondor has a general mechanism for declaring user-defined slot resources. We will use this mechanism to define a resource called 'GPUs'. The resource type name 'GPUs' is case insensitive, but you must be consistent about the plural. HTCondor considers 'GPU' to be a different resource type than 'GPUs'.  We recommend the use of 'GPUs' for GPU custom resources.

To define 'GPUs' as a custom resource simply add the following statements to the configuration on your execute node.

    MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
    ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL


The first line tells HTCondor to run the condor_gpu_discovery tool, and use it's output to define a custom resource called 'GPUs'.

The second line tells HTCondor to publish the AssignedGPUs for a slot in the job's environment using the environment variables =CUDA_VISIBLE_DEVICES= and =GPU_DEVICE_ORDINAL=. If you know for certain that your devices will be CUDA then you can omit =GPU_DEVICE_ORDINAL= in the configuration above.  If you know for certain that your devices are OpenCL only, then you can omit =CUDA_VISIBLE_DEVICES=. In addition =AssignedGPUs= will always be published into the job's environment as =_CONDOR_AssignedGPUs=, so the second line above is not strictly necessary, but it is recommended.

The output of the condor_gpu_discovery tool will report =DetectedGPUs= and list the GPU id of each one.  GPU ids will be CUDA<n> or OCL<n> where <n> is an integer, and CUDA or OCL indicates whether the CUDA library or the OpenCL library is used to communicate with the device.

The -properties argument in the command above tells condor_gpu_discovery to also list significant attributes of the device(s). These attributes will then be published in the slot ads.

This is typical output of condor_gpu_discovery
{code}
> condor_gpu_discovery -properties
DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"
CUDACapability=3.0
CUDADeviceName="GeForce GTX 690"
CUDADriverVersion=5.50
CUDAECCEnabled=false
CUDAGlobalMemory=2048
CUDARuntimeVersion=5.0
{endcode}

This output indicates that 4 GPUs were detected, all of which have the same properties. If you are using a static slot configuration, and you wish to
control how many GPUs are assigned to each slot, use the =SLOT_TYPE_<n>= configuration syntax to specify =Gpus=, the same as you would for =Cpus= or =Memory=. If you don't specify, slots default to =GPUS=auto= which will assign
GPUs proportionally to slots until there are no more GPUs to assign, then it will simply assign 0 GPUs to the remaining slots. So a machine with =NUM_CPUS=8= and =DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"= will assign 1 GPUs each to the first 4 slots, and no GPUs to the remaining slots. Slots with GPUs assigned with include the following attributes.

{code}
Cpus=1
GPUs=1
TotalCpus=8
TotalGPUs=4
TotalSlotCpus=1
TotalSlotGPUs=1
CUDACapability=3.0
CUDADeviceName="GeForce GTX 690"
CUDADriverVersion=5.50
CUDAECCEnabled=false
CUDAGlobalMemory=2048
CUDARuntimeVersion=5.0
{endcode}


If you are using a Partitionable slot, the by default the Partitionable slot will be assigned both GPUs.  Dynamic slots created from this partitionable slot will be assigned GPUs when the job requests them.


{section: Require a GPU}

User jobs that require a GPU must specify this requirement.  In a job's submit file, it might do something as simple as

{code}
Request_GPUs = 1
{endcode}

or as complex as

{code}
Request_GPUs = 2
Requirements = CUDARuntimeVersion >= 5.5 \
    && (CUDACapability >= 3.0) \
    && (CUDAGlobalMemoryMb >= 1500)
{endcode}

specifying that the job requires a CUDA GPU with at least 1500 Mb of memory, the CUDA runtime version 5.5 or later, and a CUDA Capability of 3.0 or greater.

{section: Identify the GPU}

Once a job matches to a given slot, it needs to know which GPUs to use, if multiple are present.  GPUs that the job are permitted to use are specified in the slot {quote: ClassAd} as =AssignedGPUs=.  They are also published into the job's environment as =_CONDOR_AssignedGPUs=, and (if the configuration at the top is used) as =CUDA_VISIBLE_DEVICES= and =GPU_DEVICE_ORDINAL=. You can pass Assigned GPUs to your jobs arguments using the $$() syntax.  For example, if your job takes an argument "--device=X" where X is the device to use, you might do something like

{code}
arguments = "--device=$$(AssignedGPUs)"
{endcode}

Or your job might look to the environment variable =_CONDOR_AssignedGPUs=, or =CUDA_VISIBLE_DEVICES=