Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.17.0 including the Cluster releases 11.1.0 and 7.7.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose
pods flapping (ready > terminating > pending
) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers
,compose
,compose-api
,coredns
, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Bare metal¶
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region
label of the affected
region:
cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
Equinix Metal¶
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume
and FailedMount
warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>
is the Container Cloud project name where the pods failed to run<affectedPodName>
is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the pod that fails to init to0
replicas.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state isRunning
.
IAM¶
LCM¶
[23853] Replacement of a regional master node fails on bare metal and Equinix Metal¶
During replacement of a failed master node on regional clusters of the
bare metal and Equinix Metal providers, the KaaSCephOperationRequest
resource created to remove the failed node from the Ceph cluster is stuck with
the Failed
status and an error message in errorReason
. For example:
status:
removeStatus:
osdRemoveStatus:
errorReason: Timeout (30m0s) reached for waiting pg rebalance for osd 2
status: Failed
The Failed
status blocks the replacement of the failed master node.
Workaround:
On the management cluster, obtain
metadata.name
,metadata.namespace
, and thespec
section ofKaaSCephOperationRequest
being stuck:kubectl get kaascephoperationrequest <kcorName> -o yaml
Replace
<kcorName>
with the name ofKaaSCephOperationRequest
that has theFailed
status.Create a new
KaaSCephOperationRequest
template and save it as.yaml
. For example,kcor-stuck-regional.yaml
.apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: <newKcorName> namespace: <kcorNamespace> spec: <kcorSpec>
<newKcorName>
Name of new
KaaSCephOperationRequest
that differs from the failed one. Usually a failedKaaSCephOperationRequest
resource is calleddelete-request-for-<masterMachineName>
. Therefore, you can name the new resource asdelete-request-for-<masterMachineName>-new
.
<kcorNamespace>
Namespace of the failed
KaaSCephOperationRequest
resource.
<kcorSpec>
Spec of the failed
KaaSCephOperationRequest
resource.
Apply the created template to the management cluster. For example:
kubectl apply -f kcor-stuck-regional.yaml
Remove the failed
KaaSCephOperationRequest
resource from the management cluster:kubectl delete kaascephoperationrequest <kcorName>
Replace
<kcorName>
with the name ofKaaSCephOperationRequest
that has theFailed
status.
StackLight¶
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed
error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
Upgrade¶
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy
state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the
Deploy
state and identify the containerd package version:containerd --version
If the version is 1.5.9, the cluster is affected.
In
/var/log/lcm/runners/<nodeName>/deploy/
, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:The following packages will be upgraded: docker-ee docker-ee-cli The following packages will be DOWNGRADED: containerd.io STDERR: E: Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy
phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
Wait for Ansible to reconcile. The node should become
Ready
in several minutes.Wait for the next node of the cluster to get stuck on the
Deploy
phase with the Ansible log errors. Only after that, apply the steps above on the next node.Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶
[24075] Ubuntu 20.04 does not display for AWS and Equinix Metal managed clusters¶
During creation of a machine for AWS or Equinix Metal provider with public networking, the Ubuntu 20.04 option does not display in the drop-down list of operating systems in the Container Cloud UI. Only Ubuntu 18.04 displays in the list.
Workaround:
Identify the parent management or regional cluster of the affected managed cluster located in the same region.
For example, if the affected managed cluster was deployed in
region-one
, identify its parent cluster by running:kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n default get cluster -l kaas.mirantis.com/region=region-one
Replace
region-one
with the corresponding value.Example of system response:
NAME AGE test-cluster 19d
Modify the related management or regional
Cluster
object with the correct values for thecredentials-controller
Helm releases:kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n default edit cluster <managementOrRegionalClusterName>
In the system response, the editor displays the current state of the cluster. Find the
spec.providerSpec.value.kaas.regional
section.Example of the
regional
section in theCluster
object:spec: providerSpec: value: kaas: regional: - provider: aws helmReleases: - name: aws-credentials-controller values: region: region-one ... - provider: equinixmetal ...
For the
aws
andequinixmetal
providers (if available), modify thecredentials-controller
values as follows:Warning
Do not overwrite existing values. For example, if one of Helm releases already has
region: region-one
, do not modify or remove it.For
aws-credentials-controller
:values: config: allowedAMIs: - - name: name values: - "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211129" - name: owner-id values: - "099720109477"
For
equinixmetal-credentials-controller
:values: config: allowedOperatingSystems: - distro: ubuntu version: 20.04
If the
aws-credentials-controller
orequinixmetal-credentials-controller
Helm releases are missing in thespec.providerSpec.value.kaas.regional
section or thehelmReleases
array is missing for the corresponding provider, add the releases with the overwritten values.Example of the
helmReleases
array for AWS:- provider: aws helmReleases: - name: aws-credentials-controller values: config: allowedAMIs: - - name: name values: - "ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20211129" - name: owner-id values: - "099720109477" ...
Example of the
helmReleases
array for Equinix Metal:- provider: equinixmetal helmReleases: - name: equinixmetal-credentials-controller values: config: allowedOperatingSystems: - distro: ubuntu version: 20.04
Wait for approximately 2 minutes for the AWS and/or Equinix
credentials-controller
to be restarted.Log out and log in again to the Container Cloud web UI.
Restart the machine addition procedure.
Warning
After Container Cloud is upgraded to 2.18.0, remove the values
added during the workaround application from the Cluster
object.
[23002] Inability to set a custom value for a predefined node label¶
During machine creation using the Container Cloud web UI, a custom value for a node label cannot be set.
As a workaround, manually add the value to
spec.providerSpec.value.nodeLabels
in machine.yaml
.
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.