Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.29.0 including the Cluster releases 17.4.0 and 16.4.0. For the list of MOSK known issues, see MOSK release notes 25.1: Known issues.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

Bare metal

[50287] BareMetalHost with a Redfish BMC address is stuck on registering phase

During addition of a bare metal host containing a Redfish Baseboard Management Controller address with the following exemplary configuration may get stuck during the registering phase:

bmc:
  address: redfish://192.168.1.150/redfish/v1/Systems/1

Workaround:

  1. Open the ironic-config configmap for editing:

    KUBECONFIG=mgmt_kubeconfig kubectl -n kaas edit cm ironic-config
    
  2. In the data:ironic.conf section, add the enabled_firmware_interfaces parameter:

    data:
      ironic.conf: |
    
        [DEFAULT]
        ...
        enabled_firmware_interfaces = redfish,no-firmware
        ...
    
  3. Restart Ironic:

    KUBECONFIG=mgmt_kubeconfig kubectl -n kaas rollout restart deployment/ironic
    

[42386] A load balancer service does not obtain the external IP address

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

  1. Identify the service that is stuck:

    kubectl get svc -A | grep pending
    

    Example of system response:

    stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP
    
  2. Add an arbitrary label to the service that is stuck. For example:

    kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
    

    Example of system response:

    service/iam-proxy-prometheus labeled
    
  3. Verify that the external IP was allocated to the service:

    kubectl get svc -n stacklight iam-proxy-prometheus
    

    Example of system response:

    NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
    iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d
    

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

Ceph

[50637] Ceph creates second miracephnodedisable object during node disabling

During managed cluster update, if some node is being disabled and at the same time ceph-maintenance-controller is restarted, a second miracephnodedisable object is erroneously created for the node. As a result, the second object fails in the Cleaning state, which blocks managed cluster update.

Workaround

  1. On the affected managed cluster, obtain the list of miracephnodedisable objects:

    kubectl get miracephnodedisable -n ceph-lcm-mirantis
    

    The system response must contain one completed and one failed miracephnodedisable object for the node being disabled. For example:

    NAME                                               AGE   NODE NAME                                        STATE      LAST CHECK             ISSUE
    nodedisable-353ccad2-8f19-4c11-95c9-a783abb531ba   58m   kaas-node-91207a35-3200-41d1-9ba9-388500970981   Ready      2025-03-06T22:04:48Z
    nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef   57m   kaas-node-91207a35-3200-41d1-9ba9-388500970981   Cleaning   2025-03-07T11:59:27Z   host clean up Job 'ceph-lcm-mirantis/host-cleanup-nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef' is failed, check logs
    
  2. Remove the failed miracephnodedisable object. For example:

    kubectl delete miracephnodedisable -n ceph-lcm-mirantis nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef
    

[50566] Ceph upgrade is very slow during patch or major cluster update

Due to the upstream Ceph issue 66717, during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start slow and even fail the starting probe with the following describe output in the rook-ceph-osd-X pod:

 Warning  Unhealthy  57s (x16 over 3m27s)  kubelet  Startup probe failed:
 ceph daemon health check failed with the following output:
> no valid command found; 10 closest matches:
> 0
> 1
> 2
> abort
> assert
> bluefs debug_inject_read_zeros
> bluefs files list
> bluefs stats
> bluestore bluefs device info [<alloc_size:int>]
> config diff
> admin_socket: invalid command

Workaround:

Complete the following steps during every patch or major cluster update of the Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes supported):

  1. Plan extra time in the maintenance window for the patch cluster update.

    Slow starts will still impact the update procedure, but after completing the following step, the recovery process noticeably shortens without affecting the overall cluster state and data responsiveness.

  2. Select one of the following options:

    • Before the cluster update, set the noout flag:

      ceph osd set noout
      

      Once the Ceph OSDs image upgrade is done, unset the flag:

      ceph osd unset noout
      
    • Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear, set the noout flag as soon as possible. Once the Ceph OSDs image upgrade is done, unset the flag.

[26441] Cluster update fails with the MountDevice failed for volume warning

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

  1. Verify that the description of the Pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    

    In the command above, replace the following values:

    • <affectedProjectName> is the Container Cloud project name where the Pods failed to run

    • <affectedPodName> is a Pod name that failed to run in the specified project

    In the Pod description, identify the node name where the Pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

  4. On every csi-rbdplugin Pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected Pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.

LCM

[50768] Failure to update the MCCUpgrade object

While editing the MCCUpgrade object, the following error occurs when trying to save changes:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"Internal error occurred: failed calling webhook \"mccupgrades.kaas.mirantis.com\":
failed to call webhook: the server could not find the requested resource",
"reason":"InternalError",
"details":{"causes":[{"message":"failed calling webhook \"mccupgrades.kaas.mirantis.com\":
failed to call webhook: the server could not find the requested resource"}]},"code":500}

To work around the issue, remove the name: mccupgrades.kaas.mirantis.com entry from mutatingwebhookconfiguration:

kubectl --kubeconfig kubeconfig edit mutatingwebhookconfiguration admission-controller

Example configuration:

- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    caBundle: <REDACTED>
    service:
      name: admission-controller
      namespace: kaas
      path: /mccupgrades
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: mccupgrades.kaas.mirantis.com
  namespaceSelector: {}
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - kaas.mirantis.com
    apiVersions:
    - v1alpha1
    operations:
    - CREATE
    - UPDATE
    resources:
    - mccupgrades
    scope: '*'
  sideEffects: NoneOnDryRun
  timeoutSeconds: 5

[50561] The local-volume-provisioner pod switches to CrashLoopBackOff

After machine disablement and consequent re-enablement, persistent volumes (PVs) provisioned by local-volume-provisioner that are not used by any pod may cause the local-volume-provisioner pod on such machine to switch to the CrashLoopBackOff state.

Workaround:

  1. Identify the ID of the affected local-volume-provisioner:

    kubectl -n kube-system get pods
    

    Example of system response extract:

    local-volume-provisioner-h5lrc   0/1   CrashLoopBackOff   33 (2m3s ago)   90m
    
  2. In the local-volume-provisioner logs, identify the affected PVs. For example:

    kubectl logs -n kube-system local-volume-provisioner-h5lrc | less
    

    Example of system response extract:

    E0304 23:21:31.455148    1 discovery.go:221] Failed to discover local volumes:
    5 error(s) while discovering volumes: [error creating PV "local-pv-1d04ed53"
    for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol04":
    persistentvolumes "local-pv-1d04ed53" already exists error creating PV "local-pv-ce2dfc24"
    for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol01":
    persistentvolumes "local-pv-ce2dfc24" already exists error creating PV "local-pv-bcb9e4bd"
    for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol02":
    persistentvolumes "local-pv-bcb9e4bd" already exists error creating PV "local-pv-c5924ada"
    for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol03":
    persistentvolumes "local-pv-c5924ada" already exists error creating PV "local-pv-7c7150cf"
    for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol00":
    persistentvolumes "local-pv-7c7150cf" already exists]
    
  3. Delete all PVs that contain the already exists error in logs. For example:

    kubectl delete pv local-pv-1d04ed53
    

[31186,34132] Pods get stuck during MariaDB operations

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

  1. Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.

  2. Verify that other replicas are up and ready.

  3. Remove the galera.cache file for the affected mariadb-server Pod.

  4. Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

StackLight

[43474] Custom Grafana dashboards are corrupted

Custom Grafana panels and dashboards may be corrupted after automatic migration of deprecated Angular-based plugins to the React-based ones. For details, see MOSK Deprecation Notes: Angular plugins in Grafana dashboards and the post-update step Back up custom Grafana dashboards in Container Cloud 2.28.4 update notes.

To work around the issue, manually adjust the affected dashboards to restore their custom appearance.

Container Cloud web UI

[50181] Failure to deploy a compact cluster

A compact MOSK cluster fails to be deployed through the Container Cloud web UI due to inability to add any label to the control plane machines along with inability to change dedicatedControlPlane: false using the web UI.

To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.

[50168] Inability to use a new project right after creation

A newly created project does not display all available tabs in the Container Cloud web UI and contains different access denied errors during first five minutes after creation.

To work around the issue, refresh the browser in five minutes after the project creation.

[50140] The Ceph Clusters tab does not display Ceph cluster details

The Clusters page for the bare metal provider does not display information about the Ceph cluster in the Ceph Clusters tab and contains access denied errors.

To work around the issue, verify the Ceph cluster state through CLI. For details, see MOSK documentation: Ceph operations - Verify Ceph.