Configure GPU virtualization

Available since MOSK 24.1 TechPreview

This section delves into virtual GPU configuration. It is specifically tailored for NVIDIA physical GPUs, with a focus on the A100 40GB GPU and NVIDIA AIE 4.1 drivers.

While setup procedures may vary among different cards and vendors, MOSK can generally ensure compatibility between the MOSK Compute service (Nova) and vGPU functionality, as long as the drivers for the physical GPU expose an VFIO mdev-compatible interface to the Linux host.

For configuration specifics of other physical GPUs, refer to the official documentation provided by the vendor.

Obtain drivers

Visit NVIDIA AI Enterprise documentation for comprehensive guidance on how to download the required drivers.

Also, if you have access to the NVIDIA NGC Catalog, search for the latest Infra Release that provides NVIDIA vGPU Host Driver there.

NVIDIA licensing

To fully utilize the capabilities of NVIDIA GPU virtualization, you may need to set up and configure the NVIDIA licensing server.

Install drivers

To install the acquired drivers within your cluster, add a custom postDeployScript script to the custom BareMetalHostProfile object used for the compute nodes with GPUs.

Note

For the instruction on how to create a custom host profile, refer to Mirantis Container Cloud: Create a custom host profile.

This script must solve the following tasks:

  • Download and install the drivers, if needed

  • Configure physical GPU according to your cluster requirements

  • Configure a startup task to reconfigure the physical GPU after node reboots.

Example postDeployScript script:

#!/bin/bash -ex
# Create a one time script that will initialize physical GPU right now and self-destruct
cat << EOF > /root/test_postdeploy_job.sh
#!/bin/bash -ex
systemctl enable initialize-vgpu
systemctl start --no-block initialize-vgpu
crontab -l | grep -v test_postdeploy_job.sh | crontab -
rm /root/test_postdeploy_job.sh
EOF
mkdir -p /var/spool/cron/crontabs/ && echo "*/1 * * * * sudo /root/test_postdeploy_job.sh >> /var/log/test_postdeploy_job.log 2>&1" >> /var/spool/cron/crontabs/root
chmod +x /root/test_postdeploy_job.sh

# Create a systemd unit that will re-initialize physical GPU on restart
cat << EOF > /etc/systemd/system/initialize-vgpu.service
[Unit]
Description=Configure VGPU
After=systemd-modules-load.service

[Service]
Type=oneshot
ExecStart=/root/initialize_vgpu.sh
RemainAfterExit=true
StandardOutput=journal
[Install]
RequiredBy=multi-user.target
EOF
cat << EOF > /root/initialize_vgpu.sh
#!/bin/bash
set -ex
while ! docker inspect ucp-kubelet;
    do echo "Waiting lcm-agent is finished.";
    sleep 1;
done
# Download and install the driver, dependencies and tools
if [[ ! -f /root/nvidia-vgpu-ubuntu-aie-535_535.129.03_amd64.deb ]]; then
    apt update
    apt install -y dkms unzip gcc libc-dev make linux-headers-$(uname -r) pciutils lshw mdevctl
    wget https://my.intra.net//root/gpu-driver-x-y-z.deb -O /root/gpu-driver-x-y-z.deb
    apt install /root/gpu-driver-x-y-z.deb
    systemctl enable initialize-vgpu
fi
systemctl restart nvidia-vgpud.service
# Enable SR-IOV mode for the pGPU
/usr/lib/nvidia/sriov-manage -e <PCI-ADDRESS-OF-NVIDIA-CARD>
# Enable MIG mode for pGPU
nvidia-smi -i 0 -mig 1
systemctl enable nvidia-vgpu-mgr.service
systemctl start nvidia-vgpu-mgr.service
EOF
chmod +x /root/initialize_vgpu.sh

Manage virtual GPU types

Virtual GPU types are similar to compute flavors as they determine the resources allocated to each virtual GPU. This allows for efficient allocation and optimization of GPU resources in virtualized environments.

Each physical GPU has a maximum number of virtual GPUs of a specific type that can be created on it, with no possibility for overallocation. In the time-sliced vGPU configuration, each particular physical GPU can only instantiate vGPUs of the same selected type. In the Multi-Instance GPU (MIG), a single physical GPU may be partitioned into several differently sized virtual GPUs.

Either way, prior to accepting workloads, Mirantis recommends determining the virtual GPU types that each of your physical GPU will provide. Altering these settings afterward necessitates terminating every virtual machine currently running on the physical GPU intended for reconfiguration or repurposing for another virtual GPU type.

Partition to Multi-Instance GPUs

This section outlines the process for partitioning physical GPUs into Multi-Instance GPUs (MIG) using the nvidia-smi tool provided by the NVIDIA Host GPU driver.

To list available virtual GPU instance profiles:

nvidia-smi mig -lgip

Example system response:

+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.5gb        19     7/7        4.75       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.5gb+me     20     1/1        4.75       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.10gb       15     4/4        9.75       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.10gb       14     3/3        9.75       No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.20gb        9     2/2        19.62      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.20gb        5     1/1        19.62      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.40gb        0     1/1        39.50      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

To create seven, which is a maximum possible amount of instances according to the system response above, MIG vGPUs of the smallest size:

nvidia-smi mig -cgi 19,19,19,19,19,19,19

To create three differently sized vGPUs of 4g.20gb, 2g.10gb, and 1g.5gb sizes:

nvidia-smi mig -cgi 5,14,19

Caution

Keep in mind that not all combinations of differently sized vGPU instances are supported. Additionally, the order in which you create vGPUs is important.

For example configurations, see NVIDIA documentation.

Find mdev class of virtual GPU type

To correctly configure the MOSK Compute service, you need to correlate the following naming schemes related to virtual GPU types:

  • The GPU instance profile as reported by nvidia-smi mig. For example, MIG 1g.5gb.

  • The vGPU type as reported by the driver. For example, GRID A100-1-5C.

  • The mdev class that corresponds to the vGPU type. For example, nvidia-474.

For the compatibility between GPU instance profiles and virtual GPU types, refer to NVIDIA documentation: Virtual GPU Types for Supported GPUs.

To determine the mdev class supported by a specific virtual GPU type listed by a PCI device address, verify the output of the mdevctl types command executed on the compute node that has a physical GPU available on it:

mdevctl types

Example system response for MIGs:

0000:42:00.4
  nvidia-1053
    Available instances: 0
    Device API: vfio-pci
    Name: GRID A100-1-10C
    Description: num_heads=1, frl_config=60, framebuffer=10240M, max_resolution=4096x2400, max_instance=4
  ...
  nvidia-474
    Available instances: 1
    Device API: vfio-pci
    Name: GRID A100-1-5C
    Description: num_heads=1, frl_config=60, framebuffer=5120M, max_resolution=4096x2400, max_instance=7
  ...

The Name field from the example system output above corresponds to the supported virtual GPU type, linking the GPU instance profile with the mdev class supported by your physical GPU.

In the example above, the MIG 1g.5gb GPU instance profile corresponds to the GRID A100-1-5C vGPU type as per NVIDIA documentation, and according to the mdevctl types` output, it corresponds to the nvidia-474 mdev class.

Note

Notice that Available instances is zero for vGPU types that are not actually supported by this given card and configuration. For MIGs, the Available instances will be non-zero only for the virtual GPU types for which the MIG virtual GPU instances have already been created. See Partition to Multi-Instance GPUs.

Configure the Compute service

The parameters you need to define for the nova-compute service on each compute node with physical GPUs you want to expose as virtual GPUs include:

  • [devices]enabled_mdev_types

    Required. List of the mdev classes, see the previous step for details.

  • [devices]cleanup_mdev_devices

    Optional. By default, the Compute service does not delete created mdev devices but reuses them instead. While this speeds up processes, it may pose challenges when reconfiguring the enabled_mdev_types parameter. Set cleanup_mdev_devices to True for the Compute service to auto-delete created mdev devices upon instance deletion.

Time-sliced vGPU

If you plan to use only time-sliced vGPUs and provide a single virtal GPU type across the entire cloud, you only need to configure the options mentioned above once globally for all compute nodes through the spec.services section of the OpenStackDeployment custom resource.

With the configuration below, the Compute service will auto-detect all PCI devices that provide this mdev type and automatically create required resource providers in the placement service with the resource class VGPU.

Example configuration for the nvidia-474 mdev type:

kind: OpenStackDeployment
spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              devices:
                enabled_mdev_types: nvidia-474
                cleanup_mdev_devices: true

If you plan to provide multiple time-sliced vGPU types, simplify the configuration by grouping the nodes based on a node label (not necessarily aggregates). Ensure that each group exposes only one mdev type using the Node-specific configuration settings. Additionally, use custom resource classes to facilitate flavor creation, ensuring consistent use of the CUSTOM_ prefix for custom mdev_class.

For example, if you want to provide the nvidia-474 and nvidia-475 mdev types, label your nodes with the vgpu-type=nvidia-474 and vgpu-type=nvidia-475 labels and use the following node-specific settings:

kind: OpenStackDeployment
spec:
  nodes:
    vgpu-type::nvidia-474:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-474
                      cleanup_mdev_devices: true
                    mdev_nvidia-474:
                      mdev_class: CUSTOM_VGPU_A100_1_5C
    vgpu-type::nvidia-475:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-475
                      cleanup_mdev_devices: true
                    mdev_nvidia-475:
                      mdev_class: CUSTOM_VGPU_A100_2_10C

The configuration above creates corresponding resource providers in the placement service that provide CUSTOM_VGPU_A100_1_5C or CUSTOM_VGPU_A100_2_10C resources. You can use these resources during the definition of flavors for instances with corresponding vGPU types.

In some cases, you may need to provide different vGPU types from a single compute node, for example, if the compute node has 2 physical GPUs and you want to create two different types of vGPU on them. For such scenarios, you should provide explicit PCI device addresses of these physical GPUs in the settings. This makes such configuration verbose in heterogeneous hardware environments where physical GPUs have different PCI addresses on each node. For example, when targeting node-specific settings by node name:

kind: OpenStackDeployment
spec:
  nodes:
    kubernetes.io/hostname::kaas-node-7af9aab1-596d-4ba3-a717-846653aa441a:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-474,nvidia-475
                      cleanup_mdev_devices: true
                    mdev_nvidia-474:
                      device_addresses: 0000:42:00.0
                      mdev_class: CUSTOM_VGPU_A100_1_5C
                    mdev_nvidia-475:
                      device_addresses: 0000:43:00.0
                      mdev_class: CUSTOM_VGPU_A100_2_10C

Multi-Instance GPU (MIG)

In the SR-IOV mode, the driver typically creates more virtual functions than the maximum capacity of the physical GPU, even for the smallest virtual GPU type. Each virtual function can hold only one single virtual GPU. This leads to resource over-reporting to the placement service.

Therefore, to ensure efficient resource allocation and utilization within a homogeneous hardware environment, assuming that each compute node in it has the same PCI address for the physical GPU and the physical GPU has been partitioned to the MIG GPU instances identically:

  1. Identify the number of instances created of each MIG profile.

  2. Select random but not overlapping sets of PCI addresses from the list of virtual functions of the physical GPU. The amount of addresses in each set must correspond to the number of instances created of each MIG profile.

  3. Assign the mdev type to the selected devices.

For example, for the environment with the following configuration:

  • 3 MIG instances of MIG 1.5gb and 2 MIG instances of MIG 2.10gb

  • 16 virtual functions created for the physical GPU with the PCI address range from 0000:42:00.0 to 0000:42:01.7

Pick 3 and 2 random PCI addresses from that pool and assign them to CUSTOM_VGPU_A100_1_5C and CUSTOM_VGPU_A100_2_10C mdev classes respectively:

spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              devices:
                enabled_mdev_types: nvidia-474,nvidia-475
                cleanup_mdev_devices: true
              mdev_nvidia-474:
                device_addresses: 0000:42:00.0,0000:42:00.1,0000:42:00.2
                mdev_class: CUSTOM_VGPU_A100_1_5C
              mdev_nvidia-475:
                device_addresses: 0000:42:01.0,0000:42:01.1
                mdev_class: CUSTOM_VGPU_A100_2_10C

In a heterogeneous hardware environment, use node-specific settings to group nodes with the same PCI addresses and intended vGPU configuration, or use explicit setting for each node targeting node-specific settings to every node, sequentially if needed.

Verify resource providers

This section provides guidelines for verifying that virtual GPUs are correctly accounted for in the OpenStack Placement service, ensuring proper scheduling of instances that utilize virtual GPUs.

Firstly, verify that resource providers have been created with accurate inventories. For each PCI device associated with a virtual GPU, including virtual instances in the case of MIG/SR-IOV, there should be a nested resource provider under the resource provider of the corresponding compute node. The name of this nested resource provider should follow the format <node-name>_pci_<pci-address-with-underscores>:

openstack resource provider list --resource CUSTOM_VGPU_A100_1_5C=1 -f yaml

Example system response:

- generation: 1
  name: kaas-node-9d18b7c8-7ea8-4b13-abe9-0e76ee8db596.kaas-kubernetes-294cbb1cbf084789b931ebc54d3f9b05_pci_0000_42_00_4
  parent_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
  root_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
  uuid: 963bb3ce-3ed1-421f-a186-a808c3460c48
  ...

Also, examine the inventory of each resource provider. It should exclusively consist of the VGPU resource or any custom resource name configured in the Compute service settings. The total capacity of the resource should match the capacity reported by the mdevctl types output, reflecting the capabilities of the PCI device for the specified mdev class. In the case of MIG, this total capacity will always be 1.

openstack resource provider inventory list 963bb3ce-3ed1-421f-a186-a808c3460c48 -f yaml

Example system response:

- allocation_ratio: 1.0
  max_unit: 1
  min_unit: 1
  reserved: 0
  resource_class: CUSTOM_VGPU_A100_1_5C
  step_size: 1
  total: 1
  used: 0

Create required resources

This section provides instructions for creating a flavor that requests a specific virtual GPU resource, using the mdev classes configured in the Compute service and registered in the placement service.

To create the flavor, use the openstack flavor create command. Ensure that the flavor properties match the configured mdev classes in the Compute service. For example, to request one vGPU of type nvidia-474 using the resource class from the previous examples:

openstack flavor create --ram 1024 --vcpus 2 --disk 5 --property resources:CUSTOM_VGPU_A100_1_5C=1

Replace the --property resources:CUSTOM_VGPU_A100_1_5C=1 parameter with the appropriate property matching the desired virtual GPU type and quantity.

Once the flavor is created, you can start launching instances using the created flavor as usual.