GPU virtualization

Available since MOSK 24.1 TechPreview

MOSK provides GPU virtualization capabilities to its users through the NVIDIA vGPU and Multi-Instance GPU (MIG) technologies.

GPU virtualization is a capability offered by modern datacenter-grade GPUs, enabling the partitioning of a single physical GPU into smaller virtual devices, that can then be attached to individual virtual machines.

In contrast to the Peripheral Component Interconnect (PCI) passthrough feature, leveraging the GPU virtualization enables concurrent utilization of the same physical GPU device by multiple virtual machines. This enhances hardware utilization and fosters a more elastic consumption of expensive hardware resources.

When using GPU virtualization, the physical device and its drivers manage computing resource partitioning and isolation.

Untitled Diagram

The use case for GPU virtualization aligns with any application necessitating or benefiting from accelerated parallel floating-point calculations, such as graphic-intensive desktop workloads, for example, 3D modeling and rendering, as well as computationally intensive tasks, for example, artifial intelligence, specifically, machine learning training and classification.

At its core, GPU virtualization operates on base of the single-root input/output virtualization framework (SR-IOV), which is already widely used by datacenter-grade network adapters and mediated devices Linux kernel framework.

Hardware drivers

Typically, using GPU virtualization requires the installation of specific physical GPU drivers on the host system. For detailed instructions on obtaining and installing the required drivers, refer to official documentation from the vendor of your GPU.

For the latest family of NVIDIA GPUs under NVIDIA AI Enterprise, start with NVIDIA AI Enterprise documentation.

You can automate the configuration of drivers by adding a custom post-install script to the BareMetalHostProfile object of your MOSK cluster. See Configure GPU virtualization for details.

NVIDIA GPU virtualization modes

Certain NVIDIA GPUs, for example, Ampere GPU architecture and later, support GPU virtualization in two modes: time sliced (vGPU) or Multi-Instance GPU (MIG). Older architectures support only the time-sliced mode.

The distinction between these modes lies in resource isolation, dedicated performance levels, and partitioning flexibility.

Typically, there is no fixed rule dictating which mode should be used, as it depends on the intended workloads for the virtual GPUs and the level of experience and assurances the cloud operator aims to offer users. Below, there is a brief overview of the differences between these two modes.

Time-sliced vGPUs

In time-sliced vGPU mode, each virtual GPU is allocated dedicated slices of the physical GPU memory while sharing the physical GPU engines. Only one vGPU operates at a time, with full access to all physical GPU engines. The resource scheduler within the physical GPU regulates the timing of each vGPU execution, ensuring fair allocation of resources.

Therefore, this setup may encounter issues with noisy neighbors, where the performance of one vGPU is affected by resource contention from others. However, when not all available vGPU slots are occupied, the active ones can fully utilize the power of its physical GPU.


  • Potential ability to fully utilize the compute power of physical GPU, even if not all possible vGPUs have yet been created on that physical GPU.

  • Easier configuration.


  • Only a single vGPU type (size of the vGPU) can be created on any given physical GPU. The cloud operator must decide beforehand what type of vGPU each physical GPU will be providing.

  • Less strict resource isolation. Noisy neighbors and unpredictable level of performance for every single guest vGPU.

Multi-Instance GPUs

In Multi-Instance GPUs (MIG) mode, each virtual GPU is allocated dedicated physical GPU engines, exclusively utilized by that specific virtual GPU. Virtual GPUs run in parallel, each on its own engines according to their type.


  • Ability to partition a single physical GPU into various types of virtual GPUs. This approach provides cloud operators with enhanced flexibility in determining the available vGPU types for cloud users. However, the cloud operator has to decide beforehand what types of virtual GPU each physical GPU will be providing and partition each GPU accordingly.

  • Better resource isolation and guaranteed resource access with predictable performance levels for every virtual GPU.


  • Under-utilization of physical GPU when not all possible virtual GPU slots are occupied.

  • Comparatively complicated configuration, especially in heterogeneous hardware environments.

Known limitations


Some of these restrictions may be lifted in future releases of MOSK.

Cloud users will face the following limitations when working with GPU virtualization in MOSK:

  • Inability to create several instances with virtual GPUs in one request if there is no physical GPU available that can fit all of them at once. For NVIDIA MIG, this effectively means that you cannot create several instances with virtual GPUs in one request.

  • Inability to create an instance with several virtual GPUs.

  • Inability to attach virtual GPU to or detach virtual GPU from a running instance.

  • Inability to live-migrate instances with virtual GPU attached.

Cloud operator will face the following limitations when configuring GPU virtualization in MOSK:

  • Partition of physical GPUs to virtual GPUs is static and not on-demand. You need to decide beforehand what types of virtual GPUs each physical GPU will get partinioned into. Changing of the partitioning requires removing all instances using virtual GPUs from the compute node before initiating the repartitioning process.

  • Repartitioning may require additional manual steps to eliminate orphan resource providers in the placement service, and thus, avoid resource over-reporting and instance scheduling problems.

  • Configuration of multiple virtual GPU types per node may be very verbose since configuration depends on particular PCI addresses of physical GPUs on each node.