Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.16.0 including the Cluster releases 11.0.0 and 7.6.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose
pods flapping (ready > terminating > pending
) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers
,compose
,compose-api
,coredns
, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Equinix Metal¶
[22264] KubeContainersCPUThrottlingHigh alerts for Equinix and AWS deployments¶
The default deployment limits for Equinix and AWS controller containers
set to 400m
may be lower than the consumed amount of resources leading
to KubeContainersCPUThrottlingHigh
alerts in StackLight.
As a workaround, increase the default resource limits
for the affected
equinix-controllers
or aws-controllers
to 700m
. For example:
kubectl edit deployment -n kaas aws-controllers
spec:
...
resources:
limits:
cpu: 700m
...
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume
and FailedMount
warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>
is the Container Cloud project name where the pods failed to run<affectedPodName>
is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the pod that fails to init to0
replicas.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state isRunning
.
Bare metal¶
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region
label of the affected
region:
cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
[22563] Failure to deploy a bare metal node with RAID 1¶
Deployment of a bare metal node with an mdadm-based raid10
with LVM
enabled fails during provisioning due to insufficient cleanup of RAID devices.
Workaround:
Boot the affected node from any LiveCD, preferably Ubuntu.
Obtain details about the mdadm RAID devices:
sudo mdadm --detail --scan --verbose
Stop all mdadm RAID devices listed in the output of the above command. For example:
sudo mdadm --stop /dev/md0
Clean up the metadata on partitions with the mdadm RAID device(s) enabled. For example:
sudo mdadm --zero-superblock /dev/sda1
In the above example, replace
/dev/sda1
with partitions listed in the output of the command provided in the step 2.
[17792] Full preflight fails with a timeout waiting for BareMetalHost¶
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true
, the script fails with the following message:
preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition
Workaround:
Unset full preflight using the
unset KAAS_BM_FULL_PREFLIGHT
environment variable.Rerun bootstrap.sh preflight that executes fast preflight instead.
IAM¶
StackLight¶
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed
error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
[23006] StackLight endpoint crashes on start: private key does not match public key¶
In rare cases, StackLight applications may receive the wrong TLS certificates, which prevents them to start correctly.
As a workaround, delete the old secret for the affected StackLight
component. For example, for iam-proxy-alerta
:
kubectl -n stacklight delete secret iam-proxy-alerta-tls-certs
LCM¶
[22341] The cordon-drain states are not removed after maintenance mode is unset¶
The cordon-drain states are not removed after the maintenance mode is unset
for a machine. This issue may occur due to the maintenance transition
being stuck on the NodeWorkloadLock
object.
Workaround:
Select from the following options:
Disable the maintenance mode on the affected cluster as described in Enable cluster and machine maintenance mode.
Edit
LCMClusterState
in thespec
section by settingvalue
to"false"
:kubectl edit lcmclusterstates -n <projectName> <LCMCLusterStateName>
apiVersion: lcm.mirantis.com/v1alpha1 kind: LCMClusterState metadata: ... spec: ... value: "false"
Upgrade¶
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy
state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the
Deploy
state and identify the containerd package version:containerd --version
If the version is 1.5.9, the cluster is affected.
In
/var/log/lcm/runners/<nodeName>/deploy/
, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:The following packages will be upgraded: docker-ee docker-ee-cli The following packages will be DOWNGRADED: containerd.io STDERR: E: Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy
phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
Wait for Ansible to reconcile. The node should become
Ready
in several minutes.Wait for the next node of the cluster to get stuck on the
Deploy
phase with the Ansible log errors. Only after that, apply the steps above on the next node.Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.
Cluster health¶
[21494] Controller pods are OOMkilled after deployment¶
After a successful deployment of a management or regional cluster, controller
pods may be OOMkilled
and get stuck in CrashLoopBackOff
state due to
incorrect memory limits.
Workaround:
Increase memory resources limits on the affected Deployment
:
Open the affected
Deployment
configuration for editing:kubectl --kubeconfig <mgmtOrRegionalKubeconfig> -n kaas edit deployment <deploymentName>
Update the value of
spec.template.spec.containers.resources.limits
by 100-200 Mi. For example:spec: template: spec: containers: - ... resources: limits: cpu: "3" memory: 500Mi requests: cpu: "1" memory: 300Mi