A use case came up where multiple virtual machines needed access to various video cards.
We had the following for the host:
· ESC8000 G4
· 2x Intel Xeon Gold 6140
· 4x NVIDIA QUADRO RTX 6000
· 4x NVIDIA TESLA Pascal P100
· Ubuntu 16.04 (Xenial)
Before installing confirm the host BIOS is capable of pass though (IOMMU or VT-d) and enabled.
It looks something like this:
And of course make sure you enable VMX
See if it is already enabled in the kernel
DMAR: IOMMU enabled
Prepare the host kernel for passthrough
Add intel_iommu=on into the grub file
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"
Locate the PCI bus address of the GPUs that will be passed through
ubuntu@asus_gpu:~$ lspci | grep -i nvidia
1d:00.0 VGA compatible controller: NVIDIA Corporation Device 1e30 (rev a1)
1d:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1d:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1d:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1e:00.0 VGA compatible controller: NVIDIA Corporation Device 1e30 (rev a1)
1e:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1e:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1e:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1f:00.0 VGA compatible controller: NVIDIA Corporation Device 1e30 (rev a1)
1f:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1f:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1f:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
20:00.0 VGA compatible controller: NVIDIA Corporation Device 1e30 (rev a1)
20:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
20:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
20:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
21:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
22:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
23:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
24:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)
ubuntu@asus_gpu:~$
Look at the kernel drivers and vendor and device codes
For one of the Tesla P100
ubuntu@asus_gpu:~$ lspci -nn -k -s 21:00.0
21:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:118f]
Kernel driver in use: nouveau
Kernel modules: nouveau
For one of the RTX 6000, there are 4 because of the audio and other stuff I presume
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.0
1d:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1e30] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.1
1d:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.2
1d:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: nouveau
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.3
1d:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: nouveau
ubuntu@asus_gpu:~$
Update the file /etc/initramfs-tools/modules
With the following and fill in the device you want to pass through
vfio
vfio_iommu_type1
vfio_pci ids=10de:1e30,10de:10f7,10de:1ad6,10de:1ad7,10de:15f8 vhost-net
Update the /etc/modules file
vfio
vfio_iommu_type1
vfio_pci ids=10de:1e30,10de:10f7,10de:1ad6,10de:1ad7,10de:15f8 vhost-net
Because you updated grub and stuff run the following two commands then reboot.
sudo update-grub
sudo update-initramfs -u
Once it comes back up, see if the passthrough is enabled and if the drivers are assigned correctly.
ubuntu@asus_gpu:~$ lspci -nn -k -s 21:00.0
21:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f8] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:118f]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
ubuntu@asus_gpu:~$
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.0
1d:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1e30] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.1
1d:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f7] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.2
1d:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: vfio-pci
ubuntu@asus_gpu:~$ lspci -nn -k -s 1d:00.3
1d:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad7] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:12ba]
Kernel driver in use: vfio-pci
If the Kernel driver in use: is something other than vfio-pci double check your addresses.
Building the VM
Launch a Ubuntu KVM VM as you normally would. Once it is up and running start attaching the stuff you want by updating the KVM XML file
You can modify it manually using virsh edit or import it in:
Locate the address needed for the PCI – this would be the bus number in hex format, in my case it’s bus='0x21'
ubuntu@asus_gpu:~$ virsh nodedev-dumpxml pci_0000_21_00_0
<device>
<name>pci_0000_21_00_0</name>
<path>/sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:08.0/0000:21:00.0</path>
<parent>pci_0000_19_08_0</parent>
<driver>
<name>vfio-pci</name>
</driver>
<capability type='pci'>
<domain>0</domain>
<bus>33</bus>
<slot>0</slot>
<function>0</function>
<product id='0x15f8' />
<vendor id='0x10de'>NVIDIA Corporation</vendor>
<iommuGroup number='41'>
<address domain='0x0000' bus='0x21' slot='0x00' function='0x0'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='8' speed='8' width='16'/>
<link validity='sta' speed='8' width='16'/>
</pci-express>
</capability>
</device>
ubuntu@asus_gpu:~$
Create an XML with the following and plug in the information you got earlier
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x21' slot='0x00' function='0x0'/>
</source>
<alias name='hostdev0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</hostdev>
Attach it with the following command
virsh attach-device (VM Name) --file ~/(your_xml_file) --config"
This can all be done manually or from the GUI using Virt Manager as well. The RTX 6000 will need all 4 devices attached to function properly.
Shutdown the VM and the start it back up.
Login to the VM and run lspci | grep -i nvidia and confirm you see the GPU
NOTE:
Some Tesla GPUs will function without defining the CPU on the systems, some will not.
The ones that do not will still show up in the VM and look like it should function, but it will throw arbitrary errors when attempting to access the GPU.
Errors like:
clCreateContext(): CL_OUT_OF_RESOURCES
or
code=46(cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”
strace will show device reading the memory but stops when attempting to write
Update the CPU information in KVM and define it something other than the default hypervisor
<cpu mode='custom' match='exact'>
<model fallback='allow'>Broadwell-IBRS</model>
</cpu>
Shutdown and restart the VM for changes to take effect.
Install the nvidia drivers and cuda on your VM and have fun.
It’s always a hassle for me to install drivers so just paste the following if you want to do it quickly
sudo apt update
sudo apt install wget -y
# Download the files to install
# Nvidia 410.79 driver
wget http://us.download.nvidia.com/tesla/410.79/nvidia-diag-driver-local-repo-ubuntu1604-410.79_1.0-1_amd64.deb
# Nvidia Cuda 10 installer
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo dpkg -i cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda -y
sudo apt install nvidia-cuda-toolkit -y
# reboot
All of this can be easily scripted – feel free to hit me up for it but good chance you might be ignored since I don’t log on that often. Next chapter will be multiple containers concurrently accessing the GPU