Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.15.0 including the Cluster releases 7.5.0 and 5.22.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose
pods flapping (ready > terminating > pending
) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers
,compose
,compose-api
,coredns
, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Equinix Metal¶
[20467] Failure to deploy an Equinix Metal based management cluster¶
Deployment of an Equinix Metal based management cluster with private networking
may fail with the following error message during the Ironic deployment. The
issue is caused by csi-rbdplugin
provisioner pods that got stuck.
0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
The workaround is to restart the csi-rbdplugin
provisioner pods:
kubectl -n rook-ceph delete pod -l app=csi-rbdplugin-provisioner
Bare metal¶
[20745] Namespace deletion failure after managed cluster removal¶
After removal of a managed cluster, the namespace is not deleted due to
KaaSCephOperationRequest
CRs blocking the deletion. The workaround is to
manually remove finalizers and delete the KaaSCephOperationRequest
CRs.
Workaround:
Remove finalizers from all
KaaSCephOperationRequest
resources:kubectl -n <managed-ns> get kaascephoperationrequest -o name | xargs -I % kubectl -n <managed-ns> patch % -p '{"metadata":{"finalizers":{}}}' --type=merge
Delete all
KaaSCephOperationRequest
resources:kubectl -n <managed-ns> delete kaascephoperationrequest --all
[17792] Full preflight fails with a timeout waiting for BareMetalHost¶
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true
, the script fails with the following message:
preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition
Workaround:
Unset full preflight using the
unset KAAS_BM_FULL_PREFLIGHT
environment variable.Rerun bootstrap.sh preflight that executes fast preflight instead.
IAM¶
LCM¶
[22341] The cordon-drain states are not removed after maintenance mode is unset¶
The cordon-drain states are not removed after the maintenance mode is unset
for a machine. This issue may occur due to the maintenance transition
being stuck on the NodeWorkloadLock
object.
Workaround:
Select from the following options:
Disable the maintenance mode on the affected cluster as described in Enable cluster and machine maintenance mode.
Edit
LCMClusterState
in thespec
section by settingvalue
to"false"
:kubectl edit lcmclusterstates -n <projectName> <LCMCLusterStateName>
apiVersion: lcm.mirantis.com/v1alpha1 kind: LCMClusterState metadata: ... spec: ... value: "false"
Monitoring¶
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed
error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
[21646] The kaas-exporter container is periodically throttled and OOMKilled¶
On the highly loaded clusters, the kaas-exporter
resource limits for CPU
and RAM are lower than the consumed amount of resources. As a result, the
kaas-exporter
container is periodically throttled and OOMKilled preventing
the Container Cloud metrics gathering.
As a workaround, increase the default resource limits for kaas-exporter
in the Cluster
object of the management cluster. For example:
spec:
...
providerSpec:
...
value:
...
kaas:
management:
helmReleases:
...
- name: kaas-exporter
values:
resources:
limits:
cpu: 100m
memory: 200Mi
Upgrade¶
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy
state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the
Deploy
state and identify the containerd package version:containerd --version
If the version is 1.5.9, the cluster is affected.
In
/var/log/lcm/runners/<nodeName>/deploy/
, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:The following packages will be upgraded: docker-ee docker-ee-cli The following packages will be DOWNGRADED: containerd.io STDERR: E: Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy
phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
Wait for Ansible to reconcile. The node should become
Ready
in several minutes.Wait for the next node of the cluster to get stuck on the
Deploy
phase with the Ansible log errors. Only after that, apply the steps above on the next node.Patch the remaining nodes one-by-one using the steps above.
[20189] Container Cloud web UI reports upgrade while running previous release¶
Under certain conditions, the upgrade of the baremetal-based management
cluster may get stuck even though the Container Cloud web UI reports a
successful upgrade. The issue is caused by inconsistent metadata in IPAM that
prevents automatic allocation of the Ceph network. It happens when IPAddr
objects associated with the management cluster nodes refer to a non-existent
Subnet
object by the resource UID.
To verify whether the cluster is affected:
Inspect the
baremetal-provider
logs:kubectl -n kaas logs deployments/baremetal-provider
If the logs contain the following entries, the cluster may be affected:
Ceph public network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \ Ceph cluster network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \ 'default/kaas-mgmt' cluster nodes internal (LCM) IP addresses: 10.64.96.171,10.64.96.172,10.64.96.173 \ failed to configure ceph network for cluster default/kaas-mgmt: \ Ceph network addresses auto-assignment error: validation failed for Ceph network addresses: \ error parsing address '': invalid CIDR address:
Empty values of the network parameters in the last entry indicate that the provider cannot locate the
Subnet
object based on the data from theIPAddr
object.Note
In the logs, capture the
internal (LCM) IP addresses
of the cluster nodes to use them later in this procedure.Validate the network address used for Ceph by inspecting the
MiraCeph
object:kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +clusterNet:" kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +publicNet:"
In the system response, verify that the
clusterNet
andpublicNet
values do not contain the0.0.0.0/0
range.Example of the system response on the affected cluster:
clusterNet: 0.0.0.0/0 publicNet: 0.0.0.0/0
Workaround:
Add a label to the
Subnet
object:Note
To obtain the correct name of the label, use one of the cluster nodes internal (LCM) IP addresses from the
baremetal-provider
logs.Add
SUBNETID
as an environment variable to theIPAddr
object. For example:SUBNETID=$(kubectl get ipaddr -n default --selector=ipam/IP=10.64.96.171 -o custom-columns=":metadata.labels.ipam/SubnetID" | tr -d '\n')
Use the
SUBNETID
variable to restore the required label in theSubnet
object:kubectl -n default label subnet master-region-one ipam/UID-${SUBNETID}="1"
Verify that the
cluster.sigs.k8s.io/cluster-name
label exists forIPaddr
objects:kubectl -n default get ipaddr --show-labels|grep "cluster.sigs.k8s.io/cluster-name"
Skip the next step if all
IPaddr
objects corresponding to the management cluster nodes have this label.Add the
cluster.sigs.k8s.io/cluster-name
label toIPaddr
objects:IPADDRNAMES=$(kubectl -n default get ipaddr -o custom-columns=":metadata.name") for IP in $IPADDRNAMES; do kubectl -n default label ipaddr $IP cluster.sigs.k8s.io/cluster-name=<managementClusterName>; done
In the command above, substitute
<managementClusterName>
with the corresponding value.
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume
and FailedMount
warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>
is the Container Cloud project name where the pods failed to run<affectedPodName>
is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the pod that fails to init to0
replicas.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state isRunning
.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.