HTCondor can help manage the allocation of GPUs (graphics processing units) to jobs within a pool of execute nodes, where HTCondor considers the GPUs as resources to jobs. The jobs can use the GPUs using an API such as {link:http://www.khronos.org/opencl/ OpenCL} or {link:http://www.nvidia.com/object/cuda_home_new.html CUDA}.

The techniques described here depend on HTCondor version 8.1.4 or a later version. Those with less current versions may try an older, less flexible technique as
described at HowToManageGpusInSeriesSeven.

The general technique to manage GPUs is:

1: Advertise the GPU: configure HTCondor so that execute nodes include information about available GPUs in their machine ClassAd.
2: Job requests a GPU: jobs identify their request for a GPU and specify any further specific requirements of the GPU, in order to acquire a suitable GPU
3: Identify the GPU: Jobs modify their arguments or environment to learn which GPU they may use.

{section: Advertise the GPU}

The availability of GPU resources must be advertised in the machine's ClassAd, in order for jobs that need GPUs to be matched with machines that have GPUs. The HTCondor _condor_gpu_discovery_ tool is designed to assist in detecting GPUs and in providing configuration details that help to set up the advertisement of GPU information.
This tool detects CUDA and OpenCL devices, and it outputs a list of GPU identifiers for all detected devices.

HTCondor has a general mechanism for declaring user-defined slot resources. GPUs are a user-defined slot resource, so this mechanism is used to define a resource. These examples use the resource type name =GPUs=. These resource type names are case insensitive, but all characters within the name are significant,
so be consistent.

Define =GPUs= as a custom resource by adding the following definitions to the configuration of the execute node.

    MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
    ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL


=MACHINE_RESOURCE_GPUs= tells HTCondor to run the _condor_gpu_discovery_ tool, and use its output to define a custom resource called =GPUs=.

=ENVIRONMENT_FOR_AssignedGPUs= tells HTCondor to publish the value of machine ClassAd attribute =AssignedGPUs= for a slot in the job's environment using the environment variables =CUDA_VISIBLE_DEVICES= and =GPU_DEVICE_ORDINAL=. If you know for certain that your devices will be CUDA, then you can omit =GPU_DEVICE_ORDINAL= in the configuration above.  If you know for certain that your devices are OpenCL only, then you can omit =CUDA_VISIBLE_DEVICES=. In addition, =AssignedGPUs= will always be published into the job's environment as =_CONDOR_AssignedGPUs=, so the second line above is not strictly necessary, but it is recommended.

The output of the _condor_gpu_discovery_ tool will report =DetectedGPUs= and list the GPU id of each one.  GPU ids will be CUDA<n> or OCL<n>, where <n> is an integer, and CUDA or OCL indicates whether the CUDA library or the OpenCL library is used to communicate with the device.

The =-properties= argument in the _condor_gpu_discovery_ command tells it to also list significant attributes of the device(s). These attributes will then be published in each slot ClassAd.

Here is typical output of _condor_gpu_discovery_
{code}
> condor_gpu_discovery -properties
DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"
CUDACapability=3.0
CUDADeviceName="GeForce GTX 690"
CUDADriverVersion=5.50
CUDAECCEnabled=false
CUDAGlobalMemory=2048
CUDARuntimeVersion=5.0
{endcode}

This output indicates that 4 GPUs were detected, all of which have the same properties. If using a static slot configuration, to
control how many GPUs are assigned to each slot, use the =SLOT_TYPE_<n>= configuration syntax to specify =Gpus=, the same as would be done for =Cpus= or =Memory=. If not specified, slots default to =GPUS=auto=, which will assign
GPUs proportionally to slots until there are no more GPUs to assign, and then it will assign 0 GPUs to the remaining slots. So a machine with =NUM_CPUS=8= and =DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"= will assign 1 GPUs each to the first 4 slots, and no GPUs to the remaining slots. Slot ClassAds with GPUs assigned will include the following attributes:

{code}
Cpus=1
GPUs=1
TotalCpus=8
TotalGPUs=4
TotalSlotCpus=1
TotalSlotGPUs=1
CUDACapability=3.0
CUDADeviceName="GeForce GTX 690"
CUDADriverVersion=5.50
CUDAECCEnabled=false
CUDAGlobalMemory=2048
CUDARuntimeVersion=5.0
{endcode}


With Partitionable slots, the default partitionable slot will be assigned both GPUs.  Dynamic slots created from this partitionable slot will be assigned GPUs when the job requests them.


{section: Job requests a GPU}

User jobs that require a GPU must specify this requirement.  In a job's submit description file, the  simple request is

{code}
Request_GPUs = 1
{endcode}

A more complex request, such as:

{code}
Request_GPUs = 2
Requirements = CUDARuntimeVersion >= 5.5 \
    && (CUDACapability >= 3.0) \
    && (CUDAGlobalMemoryMb >= 1500)
{endcode}

specifies that the job requires a CUDA GPU with at least 1500 Mb of memory, the CUDA runtime version 5.5 or later, and a CUDA Capability of 3.0 or greater.

{section: Identify the GPU}

Once a job matches to a given slot, it needs to know which GPUs to use, if multiple are present.  GPUs that the job are permitted to use are specified as defined values for the slot ClassAd attribute =AssignedGPUs=.  They are also published into the job's environment with variable =_CONDOR_AssignedGPUs=. In addition, if the configuration is defined with =ENVIRONMENT_FOR_AssignedGPUs= set, environment variables =CUDA_VISIBLE_DEVICES= and =GPU_DEVICE_ORDINAL= are published. The =AssignedGPUs= attribute value can be passed as job arguments using the $$() substitution macro syntax.  For example, if the job takes an argument "--device=X" where X is the device to use, specify this in the submit description file with

{code}
arguments = "--device=$$(AssignedGPUs)"
{endcode}

Alternatively, the job might look to the environment variable =_CONDOR_AssignedGPUs=, or =CUDA_VISIBLE_DEVICES=.