GPU support for Kubernetes workloads¶

MKE provides graphics processing unit (GPU) support for Kubernetes workloads that run on Linux worker nodes. This topic describes how to configure your system to use and deploy NVIDIA GPUs.

Install the GPU drivers¶

GPU support requires that you install GPU drivers, which you can do either prior to or after installing MKE. Installing the GPU drivers installs the NVIDIA driver using a runfile on your Linux host.

Note

This procedure describes how to manually install the GPU drivers. However, Mirantis recommends that you use a pre-existing automation system to automate the installation and patching of the drivers, along with the kernel and other host software.

Enable the NVIDIA GPU device plugin by setting nvidia_device_plugin to true in the MKE configuration file.
Verify that your system supports NVIDIA GPU:
```
lspci | grep -i nvidia
```
Verify that your GPU is a supported NVIDIA GPU Product.
Install all the dependencies listed in the NVIDIA Minimum Requirements.
Verify that your system is up to date and that you are running the latest kernel version.

Install the following packages:

Ubuntu:

sudo apt-get install -y gcc make curl linux-headers-$(uname -r)

RHEL:

sudo yum install -y kernel-devel-$(uname -r) \
kernel-headers-$(uname -r) gcc make curl elfutils-libelf-devel

Verify that the i2c_core and ipmi_msghandler kernel modules are loaded:
```
sudo modprobe -a i2c_core ipmi_msghandler
```

Persist the change across reboots:

echo -e "i2c_core\nipmi_msghandler" | sudo tee /etc/modules-load.d/nvidia.conf

Review the NVIDIA libraries, which are located under the following directory on the host:

NVIDIA_OPENGL_PREFIX=/opt/kubernetes/nvidia
sudo mkdir -p $NVIDIA_OPENGL_PREFIX/lib
echo "${NVIDIA_OPENGL_PREFIX}/lib" | sudo tee /etc/ld.so.conf.d/nvidia.conf
sudo ldconfig

Install the NVIDIA GPU driver:

NVIDIA_DRIVER_VERSION=<version-number>
curl -LSf https://us.download.nvidia.com/XFree86/Linux-x86_64/${NVIDIA_DRIVER_VERSION}/NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run -o nvidia.run
sudo sh nvidia.run --opengl-prefix="${NVIDIA_OPENGL_PREFIX}"

Set <version-number> to the NVIDIA driver version of your choice.

Load the NVIDIA Unified Memory kernel module and create device files for the module on startup:

sudo tee /etc/systemd/system/nvidia-modprobe.service << END
[Unit]
Description=NVIDIA modprobe

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/nvidia-modprobe -c0 -u

[Install]
WantedBy=multi-user.target
END

sudo systemctl enable nvidia-modprobe
sudo systemctl start nvidia-modprobe

Enable the NVIDIA persistence daemon to initialize GPUs and keep them initialized:

sudo tee /etc/systemd/system/nvidia-persistenced.service << END
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target
END

sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced

Test the device plugin and review its description:

kubectl describe node <node-name>

Example output:

Capacity:
cpu:                8
ephemeral-storage:  40593612Ki
hugepages-1Gi:      0
hugepages-2Mi:      0
memory:             62872884Ki
nvidia.com/gpu:     1
pods:               110
Allocatable:
cpu:                7750m
ephemeral-storage:  36399308Ki
hugepages-1Gi:      0
hugepages-2Mi:      0
memory:             60775732Ki
nvidia.com/gpu:     1
pods:               110
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource        Requests    Limits
--------        --------    ------
cpu             500m (6%)   200m (2%)
memory          150Mi (0%)  440Mi (0%)
nvidia.com/gpu  0           0

Schedule GPU workloads¶

The following example describes how to deploy a simple workload that reports detected NVIDIA CUDA devices.

Create a practice Deployment that requests nvidia.com/gpu in the limits section. The Pod will be scheduled on any available GPUs in your system.

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    run: gpu-test
  name: gpu-test
spec:
  replicas: 1
  selector:
    matchLabels:
      run: gpu-test
  template:
    metadata:
      labels:
        run: gpu-test
    spec:
      containers:
      - command:
        - sh
        - -c
        - "deviceQuery && sleep infinity"
        image: kshatrix/gpu-example:cuda-10.2
        name: gpu-test
        resources:
          limits:
            nvidia.com/gpu: "1"
EOF

Verify that it is in the Running state:

kubectl get pods | grep "gpu-test"

NAME                        READY   STATUS    RESTARTS   AGE
gpu-test-747d746885-hpv74   1/1     Running   0          14m

Review the logs. The presence of Result = PASS indicates a successful deployment:

kubectl logs <name of the pod>

Example output:

deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-16GB"
...

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Determine the overall GPU capacity of your cluster by inspecting its nodes:

echo $(kubectl get nodes -l com.docker.ucp.gpu.nvidia="true" \
-o jsonpath="0{range .items[*]}+{.status.allocatable['nvidia\.com/gpu']}{end}") | bc

Set the proper replica number to acquire all available GPUs:
```
kubectl scale deployment/gpu-test --replicas N
```

Verify that all of the replicas are scheduled:

kubectl get pods | grep "gpu-test"

Example output:

NAME                        READY   STATUS    RESTARTS   AGE
gpu-test-747d746885-hpv74   1/1     Running   0          12m
gpu-test-747d746885-swrrx   1/1     Running   0          11m

Remove the Deployment and corresponding Pods:
```
kubectl delete deployment gpu-test
```

Troubleshooting¶

If you attempt to add an additional replica to the previous example Deployment, it will result in a FailedScheduling error with the Insufficient nvidia.com/gpu message.

Add an additional replica:

kubectl scale deployment/gpu-test --replicas N+1
kubectl get pods | grep "gpu-test"

Example output:

NAME                        READY   STATUS    RESTARTS   AGE
gpu-test-747d746885-hpv74   1/1     Running   0          14m
gpu-test-747d746885-swrrx   1/1     Running   0          13m
gpu-test-747d746885-zgwfh   0/1     Pending   0          3m26s

Review the status of the failed Deployment:

kubectl describe po gpu-test-747d746885-zgwfh

Example output:

Events:
Type     Reason            Age        From               Message
----     ------            ----       ----               -------
Warning  FailedScheduling  <unknown>  default-scheduler  0/2 nodes are available: 2 Insufficient nvidia.com/gpu.