For now, to get an A100 from AWS, you have to rent eight of them with the p4d.24xlarge
instance type. Use the "Deep Learning Base AMI (Amazon Linux 2)" to avoid having to deal with drivers and belike.
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install https://research.cs.wisc.edu/htcondor/repo/current/htcondor-release-current.amzn2.noarch.rpm sudo yum-builddep condor
Then set up your HTCondor build tree in the usual way. (Don't forget the -j BIGNUM
; this instance type has a lot of cores.)
NVidia has a MIG user guide. Of particular note:
sudo nvidia-smi -i INDEX -mig 1
enables MIG but does not create any GPU instances. Doing this step but not the next one is the (mis)configuration of relevance to HTCONDOR-476.sudo nvidia-smi mig -i 1 -cgi 19,19,19,19,19,19,19 -C
creates a 7-way split of the A100.