Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.29.0 including the Cluster releases 17.4.0 and 16.4.0. For the list of MOSK known issues, see MOSK release notes 25.1: Known issues.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
Bare metal¶
[50287] BareMetalHost with a Redfish BMC address is stuck on registering phase¶
During addition of a bare metal host containing a Redfish Baseboard Management Controller address with the following exemplary configuration may get stuck during the registering phase:
bmc:
address: redfish://192.168.1.150/redfish/v1/Systems/1
Workaround:
Open the
ironic-config
configmap for editing:KUBECONFIG=mgmt_kubeconfig kubectl -n kaas edit cm ironic-config
In the
data:ironic.conf
section, add theenabled_firmware_interfaces
parameter:data: ironic.conf: | [DEFAULT] ... enabled_firmware_interfaces = redfish,no-firmware ...
Restart Ironic:
KUBECONFIG=mgmt_kubeconfig kubectl -n kaas rollout restart deployment/ironic
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic
Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
Terminating
stateA new
ironic
Pod fails to startThe related bare metal host is stuck in the
deprovisioning
state
As a workaround, before deletion of the node running the ironic
Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶
[50637] Ceph creates second miracephnodedisable object during node disabling¶
During managed cluster update, if some node is being disabled and at the same
time ceph-maintenance-controller
is restarted, a second
miracephnodedisable
object is erroneously created for the node. As a
result, the second object fails in the Cleaning
state, which blocks
managed cluster update.
Workaround
On the affected managed cluster, obtain the list of
miracephnodedisable
objects:kubectl get miracephnodedisable -n ceph-lcm-mirantis
The system response must contain one completed and one failed
miracephnodedisable
object for the node being disabled. For example:NAME AGE NODE NAME STATE LAST CHECK ISSUE nodedisable-353ccad2-8f19-4c11-95c9-a783abb531ba 58m kaas-node-91207a35-3200-41d1-9ba9-388500970981 Ready 2025-03-06T22:04:48Z nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef 57m kaas-node-91207a35-3200-41d1-9ba9-388500970981 Cleaning 2025-03-07T11:59:27Z host clean up Job 'ceph-lcm-mirantis/host-cleanup-nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef' is failed, check logs
Remove the failed
miracephnodedisable
object. For example:kubectl delete miracephnodedisable -n ceph-lcm-mirantis nodedisable-58bbf563-1c76-4319-8c28-363d73a5efef
[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe
output in
the rook-ceph-osd-X
pod:
Warning Unhealthy 57s (x16 over 3m27s) kubelet Startup probe failed:
ceph daemon health check failed with the following output:
> no valid command found; 10 closest matches:
> 0
> 1
> 2
> abort
> assert
> bluefs debug_inject_read_zeros
> bluefs files list
> bluefs stats
> bluestore bluefs device info [<alloc_size:int>]
> config diff
> admin_socket: invalid command
Workaround:
Complete the following steps during every patch or major cluster update of the Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the following step, the recovery process noticeably shortens without affecting the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the
noout
flag:ceph osd set noout
Once the Ceph OSDs image upgrade is done, unset the flag:
ceph osd unset noout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear, set the
noout
flag as soon as possible. Once the Ceph OSDs image upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim
getting stuck in the Pending
state for the
prometheus-server
StatefulSet and the
MountVolume.MountDevice failed for volume
warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
In the command above, replace the following values:
<affectedProjectName>
is the Container Cloud project name where the Pods failed to run<affectedPodName>
is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain therbd volume mount failed: <csi-vol-uuid> is being used
error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the Pod that fails to0
replicas.On every
csi-rbdplugin
Pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected Pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state becomesRunning
.
LCM¶
[50768] Failure to update the MCCUpgrade object¶
While editing the MCCUpgrade
object, the following error occurs when trying
to save changes:
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"Internal error occurred: failed calling webhook \"mccupgrades.kaas.mirantis.com\":
failed to call webhook: the server could not find the requested resource",
"reason":"InternalError",
"details":{"causes":[{"message":"failed calling webhook \"mccupgrades.kaas.mirantis.com\":
failed to call webhook: the server could not find the requested resource"}]},"code":500}
To work around the issue, remove the
name: mccupgrades.kaas.mirantis.com
entry from
mutatingwebhookconfiguration
:
kubectl --kubeconfig kubeconfig edit mutatingwebhookconfiguration admission-controller
Example configuration:
- admissionReviewVersions:
- v1
- v1beta1
clientConfig:
caBundle: <REDACTED>
service:
name: admission-controller
namespace: kaas
path: /mccupgrades
port: 443
failurePolicy: Fail
matchPolicy: Equivalent
name: mccupgrades.kaas.mirantis.com
namespaceSelector: {}
objectSelector: {}
reinvocationPolicy: Never
rules:
- apiGroups:
- kaas.mirantis.com
apiVersions:
- v1alpha1
operations:
- CREATE
- UPDATE
resources:
- mccupgrades
scope: '*'
sideEffects: NoneOnDryRun
timeoutSeconds: 5
[50561] The local-volume-provisioner pod switches to CrashLoopBackOff¶
After machine disablement and consequent re-enablement, persistent volumes
(PVs) provisioned by local-volume-provisioner
that are not used by any pod
may cause the local-volume-provisioner
pod on such machine to switch to the
CrashLoopBackOff
state.
Workaround:
Identify the ID of the affected
local-volume-provisioner
:kubectl -n kube-system get pods
Example of system response extract:
local-volume-provisioner-h5lrc 0/1 CrashLoopBackOff 33 (2m3s ago) 90m
In the
local-volume-provisioner
logs, identify the affected PVs. For example:kubectl logs -n kube-system local-volume-provisioner-h5lrc | less
Example of system response extract:
E0304 23:21:31.455148 1 discovery.go:221] Failed to discover local volumes: 5 error(s) while discovering volumes: [error creating PV "local-pv-1d04ed53" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol04": persistentvolumes "local-pv-1d04ed53" already exists error creating PV "local-pv-ce2dfc24" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol01": persistentvolumes "local-pv-ce2dfc24" already exists error creating PV "local-pv-bcb9e4bd" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol02": persistentvolumes "local-pv-bcb9e4bd" already exists error creating PV "local-pv-c5924ada" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol03": persistentvolumes "local-pv-c5924ada" already exists error creating PV "local-pv-7c7150cf" for volume at "/mnt/local-volumes/openstack-operator/bind-mounts/vol00": persistentvolumes "local-pv-7c7150cf" already exists]
Delete all PVs that contain the
already exists
error in logs. For example:kubectl delete pv local-pv-1d04ed53
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
StackLight¶
[43474] Custom Grafana dashboards are corrupted¶
Custom Grafana panels and dashboards may be corrupted after automatic migration of deprecated Angular-based plugins to the React-based ones. For details, see MOSK Deprecation Notes: Angular plugins in Grafana dashboards and the post-update step Back up custom Grafana dashboards in Container Cloud 2.28.4 update notes.
To work around the issue, manually adjust the affected dashboards to restore their custom appearance.
Container Cloud web UI¶
[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane: false
using the web UI.
To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different access denied
errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.
[50140] The Ceph Clusters tab does not display Ceph cluster details¶
The Clusters page for the bare metal provider does not display
information about the Ceph cluster in the Ceph Clusters tab and
contains access denied
errors.
To work around the issue, verify the Ceph cluster state through CLI. For details, see MOSK documentation: Ceph operations - Verify Ceph.