Troubleshooting#

Steps to Debug KubeVirt Cluster Deployments#

Check the ClusterDeployment status condition on the management or regional cluster:

kubectl -n $CLUSTER_NAMESPACE get clusterdeployment.k0rdent.mirantis.com $CLUSTER_NAME -o=jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jq

Check the KubevirtCluster status condition on the management or regional cluster:

kubectl -n $CLUSTER_NAMESPACE get kubevirtcluster $CLUSTER_NAME -o=jsonpath='{.status.conditions[?(@.type=="Ready")]}' | jq

Check the vm,vmi status on the KubeVirt Infrastructure Cluster:

kubectl --kubeconfig $KUBEVIRT_INFRA_KUBECONFIG_PATH -n $CLUSTER_NAMESPACE get vm -l cluster.x-k8s.io/cluster-name=$CLUSTER_NAME -o=yaml

Check the logs of the virt-handler pod on the KubeVirt Infrastructure Cluster:

kubectl --kubeconfig $KUBEVIRT_INFRA_KUBECONFIG_PATH -n kubevirt get pods -l kubevirt.io=virt-handler

Sometimes you need to SSH into the VM created on the KubeVirt Infrastructure Cluster to check system or k0s logs:
```
virtctl console -n $CLUSTER_NAMESPACE $VM_NAME --kubeconfig $KUBEVIRT_INFRA_KUBECONFIG_PATH
```
or you can port-forward the SSH port to a virtualmachine and access directly via SSH:
```
virtctl port-forward vmi/$VM_NAME -n $CLUSTER_NAMESPACE --kubeconfig $KUBEVIRT_INFRA_KUBECONFIG_PATH $LOCAL_PORT:22
```
Then you can SSH into the VM:
```
ssh -p $LOCAL_PORT capk@127.0.0.1 -i $SSH_PRIVATE_KEY_PATH
```
Warning

The SSH key pair is generated by Cluster API Provider KubeVirt during the provisioning process. You can retrieve the private key from the secret created in the management or regional cluster:
```
kubectl get secret -n $CLUSTER_NAMESPACE $CLUSTER_NAME-ssh-keys -o=jsonpath={.data.key} | base64 -d
```
Check logs on the VM:

For k0s logs:
```
sudo journalctl -u k0sworker
```
For containers logs, see /var/log/containers directory. For more information see k0s troubleshooting.

Known Issues#

The KubeVirtCluster deployment fails on `proto: integer overflow` error#

Related issue #369

When deploying the KubeVirt cluster, if the namespace where the ClusterDeployment has been created does not exist on the KubeVirt Infrastructure Cluster, the following misleading error may appear in the cluster-api-provider-kubevirt logs:

E0126 13:41:19.622971       1 controller.go:474] "Reconciler error" err="failed to create load balancer: proto: integer overflow"
controller="kubevirtcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="KubevirtCluster"
KubevirtCluster="kcm-system/my-kubevirt-clusterdeployment1" namespace="kcm-system" name="my-kubevirt-clusterdeployment1"
reconcileID="2074dadd-23e8-4d64-bf87-b7da2905f347"

The ClusterDeployment Ready condition will be:

kubectl -n kcm-system get clusterdeployment my-kubevirt-clusterdeployment1 -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

{
  "lastTransitionTime": "2026-01-26T13:37:06Z",
  "message": "* InfrastructureReady:
    * LoadBalancerAvailable: proto: integer overflow (0/1 conditions met)
    * ControlPlaneInitialized: Control plane not yet initialized
    * ControlPlaneAvailable: K0sControlPlane status.initialization.controlPlaneInitialized is false
    * WorkersAvailable:
      * MachineDeployment my-kubevirt-clusterdeployment1-md: 0 available replicas, at least 1 required (spec.strategy.rollout.maxUnavailable is 0, spec.replicas is 1)
      * RemoteConnectionProbe: Remote connection not established yet",
  "reason": "Failed",
  "status": "False",
  "type": "Ready"
}

Workaround

Most likely the issue is caused by the missing namespace where the ClusterDeployment exists on the KubeVirt Infrastructure Cluster. You must create the namespace in advance before creating the ClusterDeployment object:

kubectl --kubeconfig <kubevirt-infra-cluster-kubeconfig> create namespace <cld-namespace>

bridge-marker CrashLoopBackOff when Ceph RADOS Gateway uses port 8081#

On bare-metal worker nodes where Mirantis k0rdent Virtualization and Ceph are deployed together, the CNAO-managed bridge-marker DaemonSet may fail to start. The bridge-marker pod runs with hostNetwork: true and binds its health HTTP server to port 8081. If Ceph RADOS Gateway (RGW) is also listening on port 8081 on the same node, bridge-marker cannot bind to the port and exits with "address already in use". The container terminates with exit code 2, probes never succeed, and the pods remain in CrashLoopBackOff:

kubectl -n kubevirt get pods -l app=bridge-marker
NAME                  READY   STATUS             RESTARTS   AGE
bridge-marker-6pcbm   0/1     CrashLoopBackOff   262        21h
bridge-marker-ck89b   0/1     CrashLoopBackOff   261        21h
bridge-marker-fmpgf   0/1     CrashLoopBackOff   261        21h

The bridge-marker container exposes port 8081 on the host:

kubectl -n kubevirt describe pod -l app=bridge-marker

Containers:
  bridge-marker:
    Port:          8081/TCP
    Host Port:     8081/TCP
    State:         Waiting
      Reason:      CrashLoopBackOff
    Last State:    Terminated
      Reason:      Error
      Exit Code:   2

Note

Starting with the HCO 1.18.4-mira release, this issue can be resolved on the HCO side by relocating the bridge-marker health prober port. Set spec.linuxBridgeConfig.bridgeMarker.healthProberPort in the HyperConverged (HCO) CR to a free port so it no longer collides with Ceph RADOS Gateway:

spec:
  networking:
    linuxBridgeConfig:
      bridgeMarker:
        healthProberPort: 9090

On earlier HCO releases, use the Ceph RADOS Gateway port workaround below instead.

Workaround

Configure the Ceph RADOS Gateway to use port 8082 instead of 8081 in the CephDeployment object. Edit the CephDeployment in the ceph-lcm-mirantis namespace and set spec.objectStorage.rgw.gateway.port to 8082:

kubectl edit cephdeployment -n ceph-lcm-mirantis

apiVersion: lcm.mirantis.com/v1alpha1
kind: CephDeployment
metadata:
  name: rook-ceph
  namespace: ceph-lcm-mirantis
spec:
  objectStorage:
    rgw:
      gateway:
        port: 8082
        securePort: 8443

After the Ceph controller applies the updated configuration, verify that RGW is listening on port 8082 and that bridge-marker pods become Ready:

kubectl -n kubevirt get pods -l app=bridge-marker
kubectl get cephdeployment -n ceph-lcm-mirantis

KubeVirt VMs fail to start on k0s with Calico 3.32+ when VM IP persistence is enabled#

On k0s clusters running Calico v3.32+ with kubeVirtVMAddressPersistence: Enabled, virt-launcher pods fail during network setup. Calico CNI tries to read the backing VirtualMachineInstance, but the calico-cni-plugin service account lacks RBAC permission. VMs remain stuck in Starting / Scheduling:

plugin type="calico" failed (add): failed to get VMI info:
failed to query VMI resource default/vm:
virtualmachineinstances.kubevirt.io "vm" is forbidden:
User "system:serviceaccount:kube-system:calico-cni-plugin" cannot get resource
"virtualmachineinstances" in API group "kubevirt.io" in the namespace "default"

Calico 3.32 enables KubeVirt VM IP persistence (kubeVirtVMAddressPersistence: Enabled). The CNI plugin verifies virt-launcher pods by reading the corresponding VMI via ownerReferences. The k0s-managed calico-cni-plugin ClusterRole only grants core Calico permissions and is missing KubeVirt rules that Calico 3.32 expects:

- apiGroups: ["kubevirt.io"]
  resources: ["virtualmachineinstances", "virtualmachines"]
  verbs: ["get"]

For more information, see the Calico KubeVirt networking documentation.

Workaround

Patch the calico-cni-plugin ClusterRole to grant the required KubeVirt permissions:

kubectl patch clusterrole calico-cni-plugin --type='json' -p='[
  {"op":"add","path":"/rules/-","value":{
    "apiGroups":["kubevirt.io"],
    "resources":["virtualmachineinstances","virtualmachines"],
    "verbs":["get"]
  }}
]'

After applying the patch, delete any stuck virt-launcher pods so they are recreated with working network setup.

Sidecar and Root feature gates are not converted from `v1beta1` to `v1` in HCO 1.18.4-mira#

Starting with HCO 1.18.4-mira, the HyperConverged API moves from hco.kubevirt.io/v1beta1 to hco.kubevirt.io/v1. The conversion webhook migrates most fields automatically, but custom Sidecar and Root feature gate settings from v1beta1 are not preserved.

In v1beta1, these settings were configured as boolean fields under spec.featureGates (for example, sidecar: true or root: true). In v1, feature gates use a list of objects with name and optional state fields. After upgrade, clusters that previously enabled or disabled sidecar or root behavior may revert to defaults unless the settings are reapplied manually.

Before upgrading, record the existing v1beta1 feature gate values:

kubectl -n kubevirt get hyperconverged kubevirt-hyperconverged -o jsonpath='{.spec.featureGates}{"\n"}' | jq

After upgrading to 1.18.4-mira, verify the converted HyperConverged CR:

kubectl -n kubevirt get hyperconverged kubevirt-hyperconverged -o jsonpath='{.spec.featureGates}' | jq

Workaround

If sidecar or root settings were lost during conversion, reapply them in the v1 HyperConverged CR:

apiVersion: hco.kubevirt.io/v1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: kubevirt
spec:
  featureGates:
    - name: downwardMetrics
    - name: sidecar
    - name: root
  virtualization:
    platform: k0s
  deployment:
    nodePlacements:
      infra:
        nodeSelector:
          kubernetes.io/os: linux

Only include sidecar and root entries if they were previously enabled in your environment. For more information on the v1 HyperConverged API format, see Configuration.

Troubleshooting#

Steps to Debug KubeVirt Cluster Deployments#

Known Issues#

The KubeVirtCluster deployment fails on proto: integer overflow error#

bridge-marker CrashLoopBackOff when Ceph RADOS Gateway uses port 8081#

KubeVirt VMs fail to start on k0s with Calico 3.32+ when VM IP persistence is enabled#

Sidecar and Root feature gates are not converted from v1beta1 to v1 in HCO 1.18.4-mira#

The KubeVirtCluster deployment fails on `proto: integer overflow` error#

Sidecar and Root feature gates are not converted from `v1beta1` to `v1` in HCO 1.18.4-mira#