Configure GPU virtualization¶
Available since MOSK 24.1 TechPreview
This section delves into virtual GPU configuration. It is specifically tailored for NVIDIA physical GPUs, with a focus on the A100 40GB GPU and NVIDIA AIE 4.1 drivers.
While setup procedures may vary among different cards and vendors, MOSK can generally ensure compatibility between the MOSK Compute service (Nova) and vGPU functionality, as long as the drivers for the physical GPU expose an VFIO mdev-compatible interface to the Linux host.
For configuration specifics of other physical GPUs, refer to the official documentation provided by the vendor.
Obtain drivers¶
Visit NVIDIA AI Enterprise documentation for comprehensive guidance on how to download the required drivers.
Also, if you have access to the NVIDIA NGC Catalog, search for the latest Infra Release that provides NVIDIA vGPU Host Driver there.
NVIDIA licensing
To fully utilize the capabilities of NVIDIA GPU virtualization, you may need to set up and configure the NVIDIA licensing server.
Install drivers¶
To install the acquired drivers within your cluster, add a custom
postDeployScript
script to the custom BareMetalHostProfile
object
used for the compute nodes with GPUs.
Note
For the instruction on how to create a BareMetalHostProfile
object, refer to Operations Guide: Create a custom host profile.
This script must solve the following tasks:
Download and install the drivers, if needed
Configure physical GPU according to your cluster requirements
Configure a startup task to reconfigure the physical GPU after node reboots.
Example postDeployScript
script:
#!/bin/bash -ex
# Create a one time script that will initialize physical GPU right now and self-destruct
cat << EOF > /root/test_postdeploy_job.sh
#!/bin/bash -ex
systemctl enable initialize-vgpu
systemctl start --no-block initialize-vgpu
crontab -l | grep -v test_postdeploy_job.sh | crontab -
rm /root/test_postdeploy_job.sh
EOF
mkdir -p /var/spool/cron/crontabs/ && echo "*/1 * * * * sudo /root/test_postdeploy_job.sh >> /var/log/test_postdeploy_job.log 2>&1" >> /var/spool/cron/crontabs/root
chmod +x /root/test_postdeploy_job.sh
# Create a systemd unit that will re-initialize physical GPU on restart
cat << EOF > /etc/systemd/system/initialize-vgpu.service
[Unit]
Description=Configure VGPU
After=systemd-modules-load.service
[Service]
Type=oneshot
ExecStart=/root/initialize_vgpu.sh
RemainAfterExit=true
StandardOutput=journal
[Install]
RequiredBy=multi-user.target
EOF
cat << EOF > /root/initialize_vgpu.sh
#!/bin/bash
set -ex
while ! docker inspect ucp-kubelet;
do echo "Waiting lcm-agent is finished.";
sleep 1;
done
# Download and install the driver, dependencies and tools
if [[ ! -f /root/nvidia-vgpu-ubuntu-aie-535_535.129.03_amd64.deb ]]; then
apt update
apt install -y dkms unzip gcc libc-dev make linux-headers-$(uname -r) pciutils lshw mdevctl
wget https://my.intra.net//root/gpu-driver-x-y-z.deb -O /root/gpu-driver-x-y-z.deb
apt install /root/gpu-driver-x-y-z.deb
systemctl enable initialize-vgpu
fi
systemctl restart nvidia-vgpud.service
# Enable SR-IOV mode for the pGPU
/usr/lib/nvidia/sriov-manage -e <PCI-ADDRESS-OF-NVIDIA-CARD>
# Enable MIG mode for pGPU
nvidia-smi -i 0 -mig 1
systemctl enable nvidia-vgpu-mgr.service
systemctl start nvidia-vgpu-mgr.service
EOF
chmod +x /root/initialize_vgpu.sh
Manage virtual GPU types¶
Virtual GPU types are similar to compute flavors as they determine the resources allocated to each virtual GPU. This allows for efficient allocation and optimization of GPU resources in virtualized environments.
Each physical GPU has a maximum number of virtual GPUs of a specific type that can be created on it, with no possibility for overallocation. In the time-sliced vGPU configuration, each particular physical GPU can only instantiate vGPUs of the same selected type. In the Multi-Instance GPU (MIG), a single physical GPU may be partitioned into several differently sized virtual GPUs.
Either way, prior to accepting workloads, Mirantis recommends determining the virtual GPU types that each of your physical GPU will provide. Altering these settings afterward necessitates terminating every virtual machine currently running on the physical GPU intended for reconfiguration or repurposing for another virtual GPU type.
Partition to Multi-Instance GPUs¶
This section outlines the process for partitioning physical GPUs into Multi-Instance GPUs (MIG) using the nvidia-smi tool provided by the NVIDIA Host GPU driver.
To list available virtual GPU instance profiles:
nvidia-smi mig -lgip
Example system response:
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.5gb 19 7/7 4.75 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.5gb+me 20 1/1 4.75 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb 15 4/4 9.75 No 14 1 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.10gb 14 3/3 9.75 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.20gb 9 2/2 19.62 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.20gb 5 1/1 19.62 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.40gb 0 1/1 39.50 No 98 5 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+
To create seven, which is a maximum possible amount of instances according to the system response above, MIG vGPUs of the smallest size:
nvidia-smi mig -cgi 19,19,19,19,19,19,19
To create three differently sized vGPUs of 4g.20gb
, 2g.10gb
, and
1g.5gb
sizes:
nvidia-smi mig -cgi 5,14,19
Caution
Keep in mind that not all combinations of differently sized vGPU instances are supported. Additionally, the order in which you create vGPUs is important.
For example configurations, see NVIDIA documentation.
Find mdev class of virtual GPU type¶
To correctly configure the MOSK Compute service, you need to correlate the following naming schemes related to virtual GPU types:
The GPU instance profile as reported by nvidia-smi mig. For example,
MIG 1g.5gb
.The vGPU type as reported by the driver. For example,
GRID A100-1-5C
.The mdev class that corresponds to the vGPU type. For example,
nvidia-474
.
For the compatibility between GPU instance profiles and virtual GPU types, refer to NVIDIA documentation: Virtual GPU Types for Supported GPUs.
To determine the mdev class supported by a specific virtual GPU type listed by a PCI device address, verify the output of the mdevctl types command executed on the compute node that has a physical GPU available on it:
mdevctl types
Example system response for MIGs:
0000:42:00.4
nvidia-1053
Available instances: 0
Device API: vfio-pci
Name: GRID A100-1-10C
Description: num_heads=1, frl_config=60, framebuffer=10240M, max_resolution=4096x2400, max_instance=4
...
nvidia-474
Available instances: 1
Device API: vfio-pci
Name: GRID A100-1-5C
Description: num_heads=1, frl_config=60, framebuffer=5120M, max_resolution=4096x2400, max_instance=7
...
The Name
field from the example system output above corresponds to the
supported virtual GPU type, linking the GPU instance profile with the mdev class
supported by your physical GPU.
In the example above, the MIG 1g.5gb
GPU instance profile corresponds
to the GRID A100-1-5C
vGPU type as per NVIDIA documentation, and according
to the mdevctl types` output, it corresponds to the nvidia-474
mdev class.
Note
Notice that Available instances
is zero for vGPU types that are
not actually supported by this given card and configuration. For MIGs,
the Available instances
will be non-zero only for the virtual GPU types
for which the MIG virtual GPU instances have already been created. See
Partition to Multi-Instance GPUs.
Configure the Compute service¶
The parameters you need to define for the nova-compute
service on each
compute node with physical GPUs you want to expose as virtual GPUs include:
[devices]enabled_mdev_types
Required. List of the mdev classes, see the previous step for details.
[devices]cleanup_mdev_devices
Optional. By default, the Compute service does not delete created mdev devices but reuses them instead. While this speeds up processes, it may pose challenges when reconfiguring the
enabled_mdev_types
parameter. Setcleanup_mdev_devices
toTrue
for the Compute service to auto-delete created mdev devices upon instance deletion.
Time-sliced vGPU¶
If you plan to use only time-sliced vGPUs and provide a single virtal GPU type
across the entire cloud, you only need to configure the options mentioned
above once globally for all compute nodes through the spec.services
section
of the OpenStackDeployment
custom resource.
With the configuration below, the Compute service will auto-detect all PCI
devices that provide this mdev type and automatically create required resource
providers in the placement service with the resource class VGPU
.
Example configuration for the nvidia-474
mdev type:
kind: OpenStackDeployment
spec:
services:
compute:
nova:
values:
conf:
nova:
devices:
enabled_mdev_types: nvidia-474
cleanup_mdev_devices: true
If you plan to provide multiple time-sliced vGPU types, simplify the
configuration by grouping the nodes based on a node label (not necessarily
aggregates). Ensure that each group exposes only one mdev type using the
Node-specific configuration settings. Additionally, use custom resource classes to
facilitate flavor creation, ensuring consistent use of the CUSTOM_
prefix
for custom mdev_class
.
For example, if you want to provide the nvidia-474
and nvidia-475
mdev
types, label your nodes with the vgpu-type=nvidia-474
and
vgpu-type=nvidia-475
labels and use the following node-specific settings:
kind: OpenStackDeployment
spec:
nodes:
vgpu-type::nvidia-474:
services:
compute:
nova:
nova_compute:
values:
conf:
nova:
devices:
enabled_mdev_types: nvidia-474
cleanup_mdev_devices: true
mdev_nvidia-474:
mdev_class: CUSTOM_VGPU_A100_1_5C
vgpu-type::nvidia-475:
services:
compute:
nova:
nova_compute:
values:
conf:
nova:
devices:
enabled_mdev_types: nvidia-475
cleanup_mdev_devices: true
mdev_nvidia-475:
mdev_class: CUSTOM_VGPU_A100_2_10C
The configuration above creates corresponding resource providers in the
placement service that provide CUSTOM_VGPU_A100_1_5C
or
CUSTOM_VGPU_A100_2_10C
resources.
You can use these resources during the definition of flavors for instances
with corresponding vGPU types.
In some cases, you may need to provide different vGPU types from a single compute node, for example, if the compute node has 2 physical GPUs and you want to create two different types of vGPU on them. For such scenarios, you should provide explicit PCI device addresses of these physical GPUs in the settings. This makes such configuration verbose in heterogeneous hardware environments where physical GPUs have different PCI addresses on each node. For example, when targeting node-specific settings by node name:
kind: OpenStackDeployment
spec:
nodes:
kubernetes.io/hostname::kaas-node-7af9aab1-596d-4ba3-a717-846653aa441a:
services:
compute:
nova:
nova_compute:
values:
conf:
nova:
devices:
enabled_mdev_types: nvidia-474,nvidia-475
cleanup_mdev_devices: true
mdev_nvidia-474:
device_addresses: 0000:42:00.0
mdev_class: CUSTOM_VGPU_A100_1_5C
mdev_nvidia-475:
device_addresses: 0000:43:00.0
mdev_class: CUSTOM_VGPU_A100_2_10C
Multi-Instance GPU (MIG)¶
In the SR-IOV mode, the driver typically creates more virtual functions than the maximum capacity of the physical GPU, even for the smallest virtual GPU type. Each virtual function can hold only one single virtual GPU. This leads to resource over-reporting to the placement service.
Therefore, to ensure efficient resource allocation and utilization within a homogeneous hardware environment, assuming that each compute node in it has the same PCI address for the physical GPU and the physical GPU has been partitioned to the MIG GPU instances identically:
Identify the number of instances created of each MIG profile.
Select random but not overlapping sets of PCI addresses from the list of virtual functions of the physical GPU. The amount of addresses in each set must correspond to the number of instances created of each MIG profile.
Assign the mdev type to the selected devices.
For example, for the environment with the following configuration:
3 MIG instances of
MIG 1.5gb
and 2 MIG instances ofMIG 2.10gb
16 virtual functions created for the physical GPU with the PCI address range from
0000:42:00.0
to0000:42:01.7
Pick 3 and 2 random PCI addresses from that pool and assign them to
CUSTOM_VGPU_A100_1_5C
and CUSTOM_VGPU_A100_2_10C
mdev classes
respectively:
spec:
services:
compute:
nova:
values:
conf:
nova:
devices:
enabled_mdev_types: nvidia-474,nvidia-475
cleanup_mdev_devices: true
mdev_nvidia-474:
device_addresses: 0000:42:00.0,0000:42:00.1,0000:42:00.2
mdev_class: CUSTOM_VGPU_A100_1_5C
mdev_nvidia-475:
device_addresses: 0000:42:01.0,0000:42:01.1
mdev_class: CUSTOM_VGPU_A100_2_10C
In a heterogeneous hardware environment, use node-specific settings to group nodes with the same PCI addresses and intended vGPU configuration, or use explicit setting for each node targeting node-specific settings to every node, sequentially if needed.
Verify resource providers¶
This section provides guidelines for verifying that virtual GPUs are correctly accounted for in the OpenStack Placement service, ensuring proper scheduling of instances that utilize virtual GPUs.
Firstly, verify that resource providers have been created with accurate
inventories. For each PCI device associated with a virtual GPU, including
virtual instances in the case of MIG/SR-IOV, there should be a nested resource
provider under the resource provider of the corresponding compute node.
The name of this nested resource provider should follow the format
<node-name>_pci_<pci-address-with-underscores>
:
openstack resource provider list --resource CUSTOM_VGPU_A100_1_5C=1 -f yaml
Example system response:
- generation: 1
name: kaas-node-9d18b7c8-7ea8-4b13-abe9-0e76ee8db596.kaas-kubernetes-294cbb1cbf084789b931ebc54d3f9b05_pci_0000_42_00_4
parent_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
root_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
uuid: 963bb3ce-3ed1-421f-a186-a808c3460c48
...
Also, examine the inventory of each resource provider. It should exclusively
consist of the VGPU
resource or any custom resource name configured in
the Compute service settings. The total capacity of the resource should match
the capacity reported by the mdevctl types output, reflecting the
capabilities of the PCI device for the specified mdev class. In the case
of MIG, this total capacity will always be 1.
openstack resource provider inventory list 963bb3ce-3ed1-421f-a186-a808c3460c48 -f yaml
Example system response:
- allocation_ratio: 1.0
max_unit: 1
min_unit: 1
reserved: 0
resource_class: CUSTOM_VGPU_A100_1_5C
step_size: 1
total: 1
used: 0
Create required resources¶
This section provides instructions for creating a flavor that requests a specific virtual GPU resource, using the mdev classes configured in the Compute service and registered in the placement service.
To create the flavor, use the openstack flavor create command.
Ensure that the flavor properties match the configured mdev classes in the
Compute service. For example, to request one vGPU of type nvidia-474
using
the resource class from the previous examples:
openstack flavor create --ram 1024 --vcpus 2 --disk 5 --property resources:CUSTOM_VGPU_A100_1_5C=1
Replace the --property resources:CUSTOM_VGPU_A100_1_5C=1
parameter with the
appropriate property matching the desired virtual GPU type and quantity.
Once the flavor is created, you can start launching instances using the created flavor as usual.