Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.15.0 including the Cluster releases 7.5.0 and 5.22.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Troubleshooting Guide.
Note
This section also outlines still valid known issues from previous releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready > terminating > pending) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers,compose,compose-api,coredns, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Equinix Metal¶
[20467] Failure to deploy an Equinix Metal based management cluster¶
Deployment of an Equinix Metal based management cluster with private networking
may fail with the following error message during the Ironic deployment. The
issue is caused by csi-rbdplugin provisioner pods that got stuck.
0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
The workaround is to restart the csi-rbdplugin provisioner pods:
kubectl -n rook-ceph delete pod -l app=csi-rbdplugin-provisioner
Bare metal¶
[20745] Namespace deletion failure after managed cluster removal¶
After removal of a managed cluster, the namespace is not deleted due to
KaaSCephOperationRequest CRs blocking the deletion. The workaround is to
manually remove finalizers and delete the KaaSCephOperationRequest CRs.
Workaround:
Remove finalizers from all
KaaSCephOperationRequestresources:kubectl -n <managed-ns> get kaascephoperationrequest -o name | xargs -I % kubectl -n <managed-ns> patch % -p '{"metadata":{"finalizers":{}}}' --type=merge
Delete all
KaaSCephOperationRequestresources:kubectl -n <managed-ns> delete kaascephoperationrequest --all
[17792] Full preflight fails with a timeout waiting for BareMetalHost¶
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:
preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition
Workaround:
Unset full preflight using the
unset KAAS_BM_FULL_PREFLIGHTenvironment variable.Rerun bootstrap.sh preflight that executes fast preflight instead.
IAM¶
LCM¶
[22341] The cordon-drain states are not removed after maintenance mode is unset¶
The cordon-drain states are not removed after the maintenance mode is unset
for a machine. This issue may occur due to the maintenance transition
being stuck on the NodeWorkloadLock object.
Workaround:
Select from the following options:
Disable the maintenance mode on the affected cluster as described in Enable cluster and machine maintenance mode.
Edit
LCMClusterStatein thespecsection by settingvalueto"false":kubectl edit lcmclusterstates -n <projectName> <LCMCLusterStateName>
apiVersion: lcm.mirantis.com/v1alpha1 kind: LCMClusterState metadata: ... spec: ... value: "false"
Monitoring¶
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
[21646] The kaas-exporter container is periodically throttled and OOMKilled¶
On the highly loaded clusters, the kaas-exporter resource limits for CPU
and RAM are lower than the consumed amount of resources. As a result, the
kaas-exporter container is periodically throttled and OOMKilled preventing
the Container Cloud metrics gathering.
As a workaround, increase the default resource limits for kaas-exporter
in the Cluster object of the management cluster. For example:
spec:
...
providerSpec:
...
value:
...
kaas:
management:
helmReleases:
...
- name: kaas-exporter
values:
resources:
limits:
cpu: 100m
memory: 200Mi
Upgrade¶
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the
Deploystate and identify the containerd package version:containerd --versionIf the version is 1.5.9, the cluster is affected.
In
/var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:The following packages will be upgraded: docker-ee docker-ee-cli The following packages will be DOWNGRADED: containerd.io STDERR: E: Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
Wait for Ansible to reconcile. The node should become
Readyin several minutes.Wait for the next node of the cluster to get stuck on the
Deployphase with the Ansible log errors. Only after that, apply the steps above on the next node.Patch the remaining nodes one-by-one using the steps above.
[20189] Container Cloud web UI reports upgrade while running previous release¶
Under certain conditions, the upgrade of the baremetal-based management
cluster may get stuck even though the Container Cloud web UI reports a
successful upgrade. The issue is caused by inconsistent metadata in IPAM that
prevents automatic allocation of the Ceph network. It happens when IPAddr
objects associated with the management cluster nodes refer to a non-existent
Subnet object by the resource UID.
To verify whether the cluster is affected:
Inspect the
baremetal-providerlogs:kubectl -n kaas logs deployments/baremetal-provider
If the logs contain the following entries, the cluster may be affected:
Ceph public network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \ Ceph cluster network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \ 'default/kaas-mgmt' cluster nodes internal (LCM) IP addresses: 10.64.96.171,10.64.96.172,10.64.96.173 \ failed to configure ceph network for cluster default/kaas-mgmt: \ Ceph network addresses auto-assignment error: validation failed for Ceph network addresses: \ error parsing address '': invalid CIDR address:
Empty values of the network parameters in the last entry indicate that the provider cannot locate the
Subnetobject based on the data from theIPAddrobject.Note
In the logs, capture the
internal (LCM) IP addressesof the cluster nodes to use them later in this procedure.Validate the network address used for Ceph by inspecting the
MiraCephobject:kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +clusterNet:" kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +publicNet:"
In the system response, verify that the
clusterNetandpublicNetvalues do not contain the0.0.0.0/0range.Example of the system response on the affected cluster:
clusterNet: 0.0.0.0/0 publicNet: 0.0.0.0/0
Workaround:
Add a label to the
Subnetobject:Note
To obtain the correct name of the label, use one of the cluster nodes internal (LCM) IP addresses from the
baremetal-providerlogs.Add
SUBNETIDas an environment variable to theIPAddrobject. For example:SUBNETID=$(kubectl get ipaddr -n default --selector=ipam/IP=10.64.96.171 -o custom-columns=":metadata.labels.ipam/SubnetID" | tr -d '\n')
Use the
SUBNETIDvariable to restore the required label in theSubnetobject:kubectl -n default label subnet master-region-one ipam/UID-${SUBNETID}="1"
Verify that the
cluster.sigs.k8s.io/cluster-namelabel exists forIPaddrobjects:kubectl -n default get ipaddr --show-labels|grep "cluster.sigs.k8s.io/cluster-name"
Skip the next step if all
IPaddrobjects corresponding to the management cluster nodes have this label.Add the
cluster.sigs.k8s.io/cluster-namelabel toIPaddrobjects:IPADDRNAMES=$(kubectl -n default get ipaddr -o custom-columns=":metadata.name") for IP in $IPADDRNAMES; do kubectl -n default label ipaddr $IP cluster.sigs.k8s.io/cluster-name=<managementClusterName>; done
In the command above, substitute
<managementClusterName>with the corresponding value.
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume and FailedMount warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMountevents:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>is the Container Cloud project name where the pods failed to run<affectedPodName>is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdpluginlogs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>is a unique RBD volume name.Identify
csiPodNameof the correspondingcsi-rbdplugin:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodNamelogs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSetorDeploymentof the pod that fails to init to0replicas.On every
csi-rbdpluginpod, search for stuckcsi-vol:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>value is a mapped RBD volume that usescsi-vol.Delete
volumeattachmentof the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSetorDeploymentback to the original number of replicas and wait until its state isRunning.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.