Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.22.0 including the Cluster release 11.6.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Troubleshooting Guide.
Note
This section also outlines still valid known issues from previous releases.
Bare metal¶
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
- All Pods are stuck in the - Terminatingstate
- A new - ironicPod fails to start
- The related bare metal host is stuck in the - deprovisioningstate
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
- cluster
- machine
- baremetalhost
- baremetalhostprofile
- l2template
- subnet
- ipamhost
- ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
Equinix Metal with private networking¶
[29296] Deployment of a managed cluster fails during provisioning¶
Deployment of a managed cluster based on Equinix Metal with private networking fails during provisioning with the following error:
InspectionError: Failed to obtain hardware details.
Ensure DHCP relay is up and running
Workaround:
- In - deployment/dnsmasq, udate the image tag version for the- dhcpdcontainer to- base-alpine-20230118150429:- kubectl -n kaas edit deployment/dnsmasq 
- In - dnsmasq.conf, override the default- undionly.kpxewith the- ipxe.pxeone:- kubectl -n kaas edit cm dnsmasq-config - Example of existing configuration: - dhcp-boot=/undionly.kpxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local- Example of new configuration: - dhcp-boot=/ipxe.pxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local
vSphere¶
[29647] The ‘Network prepared’ stage of cluster deployment never succeeds¶
During deployment of a vSphere-based management or regional cluster with IPAM
disabled, the Network prepared stage gets stuck in the NotStarted
status. The issue does not affect cluster deployment. Therefore, disregard
the error message.
LCM¶
[5782] Manager machine fails to be deployed during node replacement¶
Fixed in 2.28.4 (17.3.4 and 16.3.4)
During replacement of a manager machine, the following problems may occur:
- The system adds the node to Docker swarm but not to Kubernetes 
- The node - Deploymentgets stuck with failed RethinkDB health checks
Workaround:
- Delete the failed node. 
- Wait for the MKE cluster to become healthy. To monitor the cluster status: - Log in to the MKE web UI. 
- Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI. 
 
- Deploy a new node. 
[5568] The calico-kube-controllers Pod fails to clean up resources¶
Fixed in 2.28.4 (17.3.4 and 16.3.4)
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
- The - calico-kube-controllersPod fails to clean up resources associated with the deleted node
- The - calico-nodePod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectl cordon <nodeName>
kubectl drain <nodeName>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the admin
kubeconfig instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig again.
To obtain the admin kubeconfig:
kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml
If the related cluster is regional, replace <pathToMgmtKubeconfig> with
<pathToRegionalKubeconfig>.
TLS configuration¶
[29604] The ‘failed to get kubeconfig’ error during TLS configuration¶
When setting a new Transport Layer Security (TLS) certificate for a cluster,
the false positive failed to get kubeconfig error may occur on the
Waiting for TLS settings to be applied stage. No actions are required.
Therefore, disregard the error.
To verify the status of the TLS configuration being applied:
kubectl get cluster <ClusterName> -n <ClusterProjectName> -o jsonpath-as-json="{.status.providerStatus.tls.<Application>}"
Possible values for the <Application> parameter are as follows:
- keycloak
- ui
- cache
- mke
- iamProxyAlerta
- iamProxyAlertManager
- iamProxyGrafana
- iamProxyKibana
- iamProxyPrometheus
Example of system response:
[
    {
        "expirationTime": "2024-01-06T09:37:04Z",
        "hostname": "domain.com",
    }
]
In this example, expirationTime equals the NotAfter field of the
server certificate. And the value of hostname contains the configured
application name.
StackLight¶
[30040] OpenSearch is not in the ‘deployed’ status during cluster update¶
Note
The issue may affect the Container Cloud or Cluster release update to the following versions:
- 2.22.0 for management and regional clusters 
- 11.6.0 for management, regional, and managed clusters 
- 13.2.5, 13.3.5, 13.4.3, and 13.5.2 for attached MKE clusters 
The issue does not affect clusters originally deployed since the following Cluster releases: 11.0.0, 8.6.0, 7.6.0.
During cluster update to versions mentioned in the note above, the following OpenSearch-related error may occur on clusters that were originally deployed or attached using Container Cloud 2.15.0 or earlier, before the transition from Elasticsearch to OpenSearch:
The stacklight/opensearch release of the stacklight/stacklight-bundle HelmBundle
reconciled by the stacklight/stacklight-helm-controller Controller
is not in the "deployed" status for the last 15 minutes.
The issue affects clusters with elasticsearch.persistentVolumeClaimSize
configured for values other than 30Gi.
To verify that the cluster is affected:
- Verify whether the - HelmBundleReleaseNotDeployedalert for the- opensearchrelease is firing. If so, the cluster is most probably affected. Otherwise, the cluster is not affected.
- Verify the reason of the - HelmBundleReleaseNotDeployedalert for the- opensearchrelease:- kubectl get helmbundle stacklight-bundle -n stacklight -o json | jq '.status.releaseStatuses[] | select(.chart == "opensearch") | .message' - Example system response from the affected cluster: - Upgrade "opensearch" failed: cannot patch "opensearch-master" with kind StatefulSet: \ StatefulSet.apps "opensearch-master" is invalid: spec: Forbidden: \ updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden 
Workaround:
- Scale down the - opensearch-dashboardsand- metricbeatresources to- 0:- kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards && \ kubectl -n stacklight get pods -l app=opensearch-dashboards | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=10m pod kubectl -n stacklight scale --replicas 0 deployment metricbeat && \ kubectl -n stacklight get pods -l app=metricbeat | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=10m pod - Wait for the commands in this and next step to complete. The completion time depends on the cluster size. 
- Disable the - elasticsearch-curatorCronJob:- kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": true}}' 
- Scale down the - opensearch-masterStatefulSet:- kubectl -n stacklight scale --replicas 0 statefulset opensearch-master && \ kubectl -n stacklight get pods -l app=opensearch-master | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=30m pod 
- Delete the OpenSearch Helm release: - helm uninstall --no-hooks opensearch -n stacklight 
- Wait up to 5 minutes for Helm Controller to retry the upgrade and properly create the - opensearch-masterStatefulSet.- To verify readiness of the - opensearch-masterPods:- kubectl -n stacklight wait --for=condition=Ready --timeout=30m pod -l app=opensearch-master - Example of a successful system response in an HA setup: - pod/opensearch-master-0 condition met pod/opensearch-master-1 condition met pod/opensearch-master-2 condition met - Example of a successful system response in an non-HA setup: - pod/opensearch-master-0 condition met 
- Scale up the - opensearch-dashboardsand- metricbeatresources:- kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards && \ kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat && \ kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=metricbeat 
- Enable the - elasticsearch-curatorCronJob:- kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": false}}' 
[29329] Recreation of the Patroni container replica is stuck¶
During an update of a Container Cloud cluster of any type, recreation of the
Patroni container replica is stuck in the degraded state due to the liveness
probe killing the container that runs the pg_rewind procedure. The issue
affects clusters on which the pg_rewind procedure takes more time than the
full cycle of the liveness probe.
The sample logs of the affected cluster:
INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=-6
INFO:  stdout=
INFO:  stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0     LOG:  database system was interrupted; last known up at 2023-01-10 17:00:59 GMT
[64]: [2-1] 63be8d72.40 0  LOG:  could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192
[64]: [3-1] 63be8d72.40 0  LOG:  invalid primary checkpoint record
[64]: [4-1] 63be8d72.40 0  PANIC:  could not locate a valid checkpoint record
Workaround:
For the affected replica and PVC, run:
kubectl delete persistentvolumeclaim/storage-volume-patroni-<replica-id> -n stacklight
kubectl delete pod/patroni-<replica-id> -n stacklight
[28822] Reference Application triggers alerts during its upgrade¶
On managed clusters with enabled Reference Application, the following alerts are triggered during a managed cluster update from the Cluster release 11.5.0 to 11.6.0 or 7.11.0 to 11.5.0:
- KubeDeploymentOutagefor the- refappDeployment
- RefAppDown
- RefAppProbeTooLong
- RefAppTargetDown
This behavior is expected, no actions are required. Therefore, disregard these alerts.
[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM¶
On the baremetal-based management clusters, the restarts count of the
metric-collector Pod is increased in time with reason: OOMKilled in
the containerStatuses of the metric-collector Pod. Only clusters with
HTTP proxy enabled are affected.
Such behavior is expected. Therefore, disregard these restarts.
[28373] Alerta can get stuck after a failed initialization¶
During creation of a Container Cloud cluster of any type with StackLight
enabled, Alerta can get stuck after a failed initialization with only 1 Pod
in the READY state. For example:
kubectl get po -n stacklight -l app=alerta
NAME                          READY   STATUS    RESTARTS   AGE
pod/alerta-5f96b775db-45qsz   1/1     Running   0          20h
pod/alerta-5f96b775db-xj4rl   0/1     Running   0          20h
Workaround:
- Recreate the affected Alerta Pod: - kubectl --kubeconfig <affectedClusterKubeconfig> -n stacklight delete pod <stuckAlertaPodName> 
- Verify that both Alerta Pods are in the - READYstate:- kubectl get po -n stacklight -l app=alerta 
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
Ceph¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Note
The issue does not reproduce since MOSK 25.2.
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevice failed for volume warning in the StackLight event
logs.
Workaround:
- Verify that the description of the Pods that failed to run contain the - FailedMountevents:- kubectl -n <affectedProjectName> describe pod <affectedPodName> - In the command above, replace the following values: - <affectedProjectName>is the Container Cloud project name where the Pods failed to run
- <affectedPodName>is a Pod name that failed to run in the specified project
 - In the Pod description, identify the node name where the Pod failed to run. 
- Verify that the - csi-rbdpluginlogs of the affected node contain the- rbd volume mount failed: <csi-vol-uuid> is being usederror. The- <csi-vol-uuid>is a unique RBD volume name.- Identify - csiPodNameof the corresponding- csi-rbdplugin:- kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}' 
- Output the affected - csiPodNamelogs:- kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin 
 
- Scale down the affected - StatefulSetor- Deploymentof the Pod that fails to- 0replicas.
- On every - csi-rbdpluginPod, search for stuck- csi-vol:- for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done 
- Unmap the affected - csi-vol:- rbd unmap -o force /dev/rbd<i> - The - /dev/rbd<i>value is a mapped RBD volume that uses- csi-vol.
- Delete - volumeattachmentof the affected Pod:- kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id> 
- Scale up the affected - StatefulSetor- Deploymentback to the original number of replicas and wait until its state becomes- Running.