Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.28.5 including the Cluster releases 16.3.5 and 17.3.5. For the known issues in the related MOSK release, see Known issues.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Troubleshooting Guide.
Note
This section also outlines still valid known issues from previous releases.
Bare metal¶
[47202] Inspection error on bare metal hosts after dnsmasq restart¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to MOSK Troubleshooting Guide: Inspection error on bare metal hosts after dnsmasq restart.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspection error in the corresponding BareMetalHost objects.
The issue can occur when:
- The - dnsmasqpod was moved to another node.
- DHCP subnets were changed, including addition or removal. In this case, the - dhcpdcontainer of the- dnsmasqpod is restarted.- Caution - If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding DHCP subnets until the - dnsmasqpod becomes ready, then create- BareMetalHostobjects.
To verify whether the nodes are affected:
- Verify whether the - BareMetalHostobjects contain the- inspection error:- kubectl get bmh -n <managed-cluster-namespace-name> - Example of system response: - NAME STATE CONSUMER ONLINE ERROR AGE test-master-1 provisioned test-master-1 true 9d test-master-2 provisioned test-master-2 true 9d test-master-3 provisioned test-master-3 true 9d test-worker-1 provisioned test-worker-1 true 9d test-worker-2 provisioned test-worker-2 true 9d test-worker-3 inspecting true inspection error 19h 
- Verify whether the - dnsmasqpod was in- Readystate when the inspection of the affected baremetal hosts (- test-worker-3in the example above) was started:- kubectl -n kaas get pod <dnsmasq-pod-name> -oyaml - Example of system response: - ... status: conditions: - lastProbeTime: null lastTransitionTime: "2024-10-10T15:37:34Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-10-11T07:38:54Z" status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2024-10-11T07:38:54Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-10-10T15:37:34Z" status: "True" type: PodScheduled containerStatuses: - containerID: containerd://6dbcf2fc4b36ce4c549c9191ab01f72d0236c51d42947675302675e4bfaf4cdf image: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq:base-2-28-alpine-20240812132650 imageID: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq@sha256:3dad3e278add18e69b2608e462691c4823942641a0f0e25e6811e703e3c23b3b lastState: terminated: containerID: containerd://816fcf079cd544acd74e312065de5b5ed4dbf1dc6159fefffff4f644b5e45987 exitCode: 0 finishedAt: "2024-10-11T07:38:35Z" reason: Completed startedAt: "2024-10-10T15:37:45Z" name: dhcpd ready: true restartCount: 2 started: true state: running: startedAt: "2024-10-11T07:38:37Z" ... - In the system response above, the - dhcpdcontainer was not ready between- "2024-10-11T07:38:35Z"and- "2024-10-11T07:38:54Z".
- Verify the affected baremetal host. For example: - kubectl get bmh -n managed-ns test-worker-3 -oyaml - Example of system response: - ... status: errorCount: 15 errorMessage: Introspection timeout errorType: inspection error ... operationHistory: deprovision: end: null start: null inspect: end: null start: "2024-10-11T07:38:19Z" provision: end: null start: null register: end: "2024-10-11T07:38:19Z" start: "2024-10-11T07:37:25Z" - In the system response above, inspection was started at - "2024-10-11T07:38:19Z", immediately before the period of the- dhcpdcontainer downtime. Therefore, this node is most likely affected by the issue.
Workaround
- Reboot the node using the IPMI reset or cycle command. 
- If the node fails to boot, remove the failed - BareMetalHostobject and create it again:- Remove - BareMetalHostobject. For example:- kubectl delete bmh -n managed-ns test-worker-3 
- Verify that the - BareMetalHostobject is removed:- kubectl get bmh -n managed-ns test-worker-3 
- Create a - BareMetalHostobject from the template. For example:- kubectl create -f bmhc-test-worker-3.yaml kubectl create -f bmh-test-worker-3.yaml 
 
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
- Identify the service that is stuck: - kubectl get svc -A | grep pending - Example of system response: - stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP 
- Add an arbitrary label to the service that is stuck. For example: - kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1 - Example of system response: - service/iam-proxy-prometheus labeled
- Verify that the external IP was allocated to the service: - kubectl get svc -n stacklight iam-proxy-prometheus - Example of system response: - NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d 
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
- All Pods are stuck in the - Terminatingstate
- A new - ironicPod fails to start
- The related bare metal host is stuck in the - deprovisioningstate
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶
[50566] Ceph upgrade is very slow during patch or major cluster update¶
Fixed in 2.29.3 (17.3.8, 16.3.8, and 16.4.3)
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
 Warning  Unhealthy  57s (x16 over 3m27s)  kubelet  Startup probe failed:
 ceph daemon health check failed with the following output:
> no valid command found; 10 closest matches:
> 0
> 1
> 2
> abort
> assert
> bluefs debug_inject_read_zeros
> bluefs files list
> bluefs stats
> bluestore bluefs device info [<alloc_size:int>]
> config diff
> admin_socket: invalid command
Workaround:
Complete the following steps during every patch or major cluster update of the Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes supported):
- Plan extra time in the maintenance window for the patch cluster update. - Slow starts will still impact the update procedure, but after completing the following step, the recovery process noticeably shortens without affecting the overall cluster state and data responsiveness. 
- Select one of the following options: - Before the cluster update, set the - nooutflag:- ceph osd set noout - Once the Ceph OSDs image upgrade is done, unset the flag: - ceph osd unset noout 
- Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear, set the - nooutflag as soon as possible. Once the Ceph OSDs image upgrade is done, unset the flag.
 
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Note
The issue does not reproduce since MOSK 25.2.
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevice failed for volume warning in the StackLight event
logs.
Workaround:
- Verify that the description of the Pods that failed to run contain the - FailedMountevents:- kubectl -n <affectedProjectName> describe pod <affectedPodName> - In the command above, replace the following values: - <affectedProjectName>is the Container Cloud project name where the Pods failed to run
- <affectedPodName>is a Pod name that failed to run in the specified project
 - In the Pod description, identify the node name where the Pod failed to run. 
- Verify that the - csi-rbdpluginlogs of the affected node contain the- rbd volume mount failed: <csi-vol-uuid> is being usederror. The- <csi-vol-uuid>is a unique RBD volume name.- Identify - csiPodNameof the corresponding- csi-rbdplugin:- kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}' 
- Output the affected - csiPodNamelogs:- kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin 
 
- Scale down the affected - StatefulSetor- Deploymentof the Pod that fails to- 0replicas.
- On every - csi-rbdpluginPod, search for stuck- csi-vol:- for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done 
- Unmap the affected - csi-vol:- rbd unmap -o force /dev/rbd<i> - The - /dev/rbd<i>value is a mapped RBD volume that uses- csi-vol.
- Delete - volumeattachmentof the affected Pod:- kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id> 
- Scale up the affected - StatefulSetor- Deploymentback to the original number of replicas and wait until its state becomes- Running.
LCM¶
[39437] Failure to replace a master node on a Container Cloud cluster¶
Fixed in 2.29.0 (17.4.0 and 16.4.0)
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet's NodeReady condition is Unknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following command:
docker restart ucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
- Create a backup of the - /var/lib/mysqldirectory on the- mariadb-serverPod.
- Verify that other replicas are up and ready. 
- Remove the - galera.cachefile for the affected- mariadb-serverPod.
- Remove the affected - mariadb-serverPod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
StackLight¶
[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state¶
Fixed in 2.29.0 (17.4.0 and 16.4.0)
On High Availability (HA) clusters that use Local Volume Provisioner (LVP), Prometheus and OpenSearch from StackLight may share the same pool of storage. In such configuration, OpenSearch may approach the 85% disk usage watermark due to the combined storage allocation and usage patterns set by the Persistent Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on which OpenSearch and Prometheus utilize the same storage pool.
To verify that the cluster is affected:
- Verify the result of the following formula: - 0.8 × OpenSearch_PVC_Size_GB + Prometheus_PVC_Size_GB > 0.85 × Total_Storage_Capacity_GB- In the formula, define the following values: - OpenSearch_PVC_Size_GB
- Derived from - .values.elasticsearch.persistentVolumeUsableStorageSizeGB, defaulting to- .values.elasticsearch.persistentVolumeClaimSizeif unspecified. To obtain the OpenSearch PVC size:- kubectl -n <namespaceName> get cluster <clusterName> -o yaml |\ yq '.spec.providerSpec.value.helmReleases[] | select(.name == "stacklight") | .values.elasticsearch.persistentVolumeClaimSize ' - Example of system response: - 10000Gi 
- Prometheus_PVC_Size_GB
- Sourced from - .values.prometheusServer.persistentVolumeClaimSize. To obtain the Prometheus PVC size:- kubectl -n <namespaceName> get cluster <clusterName> -o yaml |\ yq '.spec.providerSpec.value.helmReleases[] | select(.name == "stacklight") | .values.prometheusServer.persistentVolumeClaimSize ' - Example of system response: - 4000Gi 
- Total_Storage_Capacity_GB
- Total capacity of the OpenSearch PVCs. For LVP, the capacity of the storage pool. To obtain the total capacity: - kubectl get pvc -n stacklight -l app=opensearch-master \ -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.storage - The system response contains multiple outputs, one per - opensearch-masternode. Select the capacity for the affected node.
 - Note - Convert the values to GB if they are set in different units. - If the formula result is positive, it is an early indication that the cluster is affected. 
- Verify whether the - OpenSearchClusterStatusWarningor- OpenSearchClusterStatusCriticalalert is firing. And if so, verify the following:- Log in to the OpenSearch web UI. 
- In Management -> Dev Tools, run the following command: - GET _cluster/allocation/explain- The following system response indicates that the corresponding node is affected: - "explanation": "the node is above the low watermark cluster setting \ [cluster.routing.allocation.disk.watermark.low=85%], using more disk space \ than the maximum allowed [85.0%], actual free: [xx.xxx%]" - Note - The system response may contain even higher watermark percent than 85.0%, depending on the case. 
 
Workaround:
Warning
The workaround implies adjustement of the retention threshold for OpenSearch. And depending on the new threshold, some old logs will be deleted.
- Adjust or set - .values.elasticsearch.persistentVolumeUsableStorageSizeGBto a lower value for the affection check formula to be non-positive. For configuration details, see MOSK Operations Guide: StackLight configuration parameters - OpenSearch.- Mirantis also recommends reserving some space for other PVCs using storage from the pool. Use the following formula to calculate the required space: - persistentVolumeUsableStorageSizeGB = 0.84 × ((1 - Reserved_Percentage - Filesystem_Reserve) × Total_Storage_Capacity_GB - Prometheus_PVC_Size_GB) / 0.8 - In the formula, define the following values: - Reserved_Percentage
- A user-defined variable that specifies what percentage of the total storage capacity should not be used by OpenSearch or Prometheus. This is used to reserve space for other components. It should be expressed as a decimal. For example, for 5% of reservation, - Reserved_Percentageis 0.05. Mirantis recommends using 0.05 as a starting point.
- Filesystem_Reserve
- Percentage to deduct for filesystems that may reserve some portion of the available storage, which is marked as occupied. For example, for EXT4, it is 5% by default, so the value must be 0.05. 
- Prometheus_PVC_Size_GB
- Sourced from - .values.prometheusServer.persistentVolumeClaimSize.
- Total_Storage_Capacity_GB
- Total capacity of the OpenSearch PVCs. For LVP, the capacity of the storage pool. To obtain the total capacity: - kubectl get pvc -n stacklight -l app=opensearch-master \ -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.storage - The system response contains multiple outputs, one per - opensearch-masternode. Select the capacity for the affected node.
 - Note - Convert the values to GB if they are set in different units. - Calculation of above formula provides a maximum safe storage to allocate for - .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this formula as a reference for setting- .values.elasticsearch.persistentVolumeUsableStorageSizeGBon a cluster.
- Wait up to 15-20 mins for OpenSearch to perform the cleaning. 
- Verify that the cluster is not affected anymore using the procedure above. 
Container Cloud web UI¶
[50181] Failure to deploy a compact cluster¶
Fixed in MOSK management 2.30.0 and MOSK 25.2
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane: false using the web UI.
To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different access denied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.