Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.29.3 including the Cluster releases 17.3.8, 16.3.8, and 16.4.3. For the known issues in the related MOSK release, see MOSK release notes 24.3.5: Known issues.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
Bare metal¶
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic
Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
Terminating
stateA new
ironic
Pod fails to startThe related bare metal host is stuck in the
deprovisioning
state
As a workaround, before deletion of the node running the ironic
Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶
[50637] Ceph creates second miracephnodedisable object during node disabling¶
During managed cluster update, if some node is being disabled and at the same
time ceph-maintenance-controller
is restarted, a second
miracephnodedisable
object is erroneously created for the node. As a
result, the second object fails in the Cleaning
state, which blocks
managed cluster update.
Workaround
On the affected managed cluster, obtain the list of
miracephnodedisable
objects:kubectl get miracephnodedisable -n ceph-lcm-mirantis
The system response must contain one completed and one failed
miracephnodedisable
object for the node being disabled. For example:NAME AGE NODE NAME STATE LAST CHECK ISSUE nodedisable-353ccad2-8f19-4c11-95c9-a783abb531ba 58m kaas-node-91207a35-3200-41d1-9ba9-388500970981 Ready 2025-03-06T22:04:48Z nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef 57m kaas-node-91207a35-3200-41d1-9ba9-388500970981 Cleaning 2025-03-07T11:59:27Z host clean up Job 'ceph-lcm-mirantis/host-cleanup-nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef' is failed, check logs
Remove the failed
miracephnodedisable
object. For example:kubectl delete miracephnodedisable -n ceph-lcm-mirantis nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim
getting stuck in the Pending
state for the
prometheus-server
StatefulSet and the
MountVolume.MountDevice failed for volume
warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
In the command above, replace the following values:
<affectedProjectName>
is the Container Cloud project name where the Pods failed to run<affectedPodName>
is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain therbd volume mount failed: <csi-vol-uuid> is being used
error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the Pod that fails to0
replicas.On every
csi-rbdplugin
Pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected Pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state becomesRunning
.
LCM¶
[50561] The local-volume-provisioner pod switches to CrashLoopBackOff¶
After machine disablement and consequent re-enablement, persistent volumes
(PVs) provisioned by local-volume-provisioner
may cause the
local-volume-provisioner
pod on such machine to switch to the
CrashLoopBackOff
state.
This occurs because re-enabling the machine results in a new node UID being
assigned in Kubernetes. As a result, the owner ID of the volumes provisioned by
local-volume-provisioner
no longer matches the new node UID. Although the
volumes remain in the correct system paths, local-volume-provisioner
detects a mismatch in ownership, leading to an unhealthy service state.
Workaround:
Identify the ID of the affected
local-volume-provisioner
:kubectl -n kube-system get pods
Example of system response extract:
local-volume-provisioner-h5lrc 0/1 CrashLoopBackOff 33 (2m3s ago) 90m
In the
local-volume-provisioner
logs, identify the affected PVs. For example:kubectl logs -n kube-system local-volume-provisioner-h5lrc | less
Example of system response extract:
E0304 23:21:31.455148 1 discovery.go:221] Failed to discover local volumes: 5 error(s) while discovering volumes: [error creating PV "local-pv-1d04ed53" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol04": persistentvolumes "local-pv-1d04ed53" already exists error creating PV "local-pv-ce2dfc24" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol01": persistentvolumes "local-pv-ce2dfc24" already exists error creating PV "local-pv-bcb9e4bd" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol02": persistentvolumes "local-pv-bcb9e4bd" already exists error creating PV "local-pv-c5924ada" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol03": persistentvolumes "local-pv-c5924ada" already exists error creating PV "local-pv-7c7150cf" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol00": persistentvolumes "local-pv-7c7150cf" already exists]
Update the
pv.kubernetes.io/provisioned-by
annotation for all PVs that are mentioned in thealready exists
errors on the enabled node. The annotation must have thelocal-volume-provisioner-<K8S-NODE-NAME>-<K8S-NODE-UID>
format.To obtain the node UID:
kubectl get node <K8S-NODE-NAME> -o jsonpath='{.metadata.uid}'
To edit annotations on the volumes:
kubectl edit pv <PV-NAME>
For example:
kubectl edit pv local-pv-ce2dfc24
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
StackLight¶
[43474] Custom Grafana dashboards are corrupted¶
Custom Grafana panels and dashboards may be corrupted after automatic migration of deprecated Angular-based plugins to the React-based ones. For details, see MOSK Deprecation Notes: Angular plugins in Grafana dashboards and the post-update step Back up custom Grafana dashboards in Container Cloud 2.28.4 update notes.
To work around the issue, manually adjust the affected dashboards to restore their custom appearance.
Container Cloud web UI¶
[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane: false
using the web UI.
To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different access denied
errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.