Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.16.0 including the Cluster releases 11.0.0 and 7.6.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

MKE
Equinix Metal
Bare metal

IAM
StackLight
LCM

Upgrade
Container Cloud web UI
Cluster health

MKE¶

[20651] A cluster deployment or update fails with not ready compose deployments¶

A managed cluster deployment, attachment, or update to a Cluster release with MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the compose pods flapping (ready > terminating > pending) and with the following error message appearing in logs:

'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
 got 0/0 replicas'
 ready: false
 type: Kubernetes

Workaround:

Disable Docker Content Trust (DCT):
1. Access the MKE web UI as admin.
2. Navigate to Admin > Admin Settings.
3. In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as calico-kube-controllers, compose, compose-api, coredns, and so on:
```
kubectl -n kube-system delete deployment <deploymentName>
```
Once done, the cluster deployment or update resumes.
Re-enable DCT.

Equinix Metal¶

[22264] KubeContainersCPUThrottlingHigh alerts for Equinix and AWS deployments¶

Fixed in 2.17.0

The default deployment limits for Equinix and AWS controller containers set to 400m may be lower than the consumed amount of resources leading to KubeContainersCPUThrottlingHigh alerts in StackLight.

As a workaround, increase the default resource limits for the affected equinix-controllers or aws-controllers to 700m. For example:

kubectl edit deployment -n kaas aws-controllers

spec:
...
  resources:
    limits:
      cpu: 700m
      ...

[16379,23865] Cluster update fails with the FailedMount warning¶

Fixed in 2.19.0

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

Verify that the description of the pods that failed to run contain the FailedMount events:
```
kubectl -n <affectedProjectName> describe pod <affectedPodName>
```
- <affectedProjectName> is the Container Cloud project name where the pods failed to run
- <affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.
1. Identify csiPodName of the corresponding csi-rbdplugin:
```
kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
-o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
```
2. Output the affected csiPodName logs:
```
kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
```
Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

On every csi-rbdplugin pod, search for stuck csi-vol:

for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
  echo $pod
  kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
done

Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```
The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

Delete volumeattachment of the affected pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.

Bare metal¶

[20736] Region deletion failure after regional deployment failure¶

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.

[22563] Failure to deploy a bare metal node with RAID 1¶

Fixed in 2.17.0

Deployment of a bare metal node with an mdadm-based raid10 with LVM enabled fails during provisioning due to insufficient cleanup of RAID devices.

Workaround:

Boot the affected node from any LiveCD, preferably Ubuntu.
Obtain details about the mdadm RAID devices:
```
sudo mdadm --detail --scan --verbose
```
Stop all mdadm RAID devices listed in the output of the above command. For example:
```
sudo mdadm --stop /dev/md0
```
Clean up the metadata on partitions with the mdadm RAID device(s) enabled. For example:
```
sudo mdadm --zero-superblock /dev/sda1
```
In the above example, replace /dev/sda1 with partitions listed in the output of the command provided in the step 2.

[17792] Full preflight fails with a timeout waiting for BareMetalHost¶

If you run bootstrap.sh preflight with KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:

preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition

Workaround:

Unset full preflight using the unset KAAS_BM_FULL_PREFLIGHT environment variable.
Rerun bootstrap.sh preflight that executes fast preflight instead.

IAM¶

[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page¶

Fixed in 2.18.0

During configuration of an identity provider SAML using the Add identity provider menu of the Keycloak admin console, the page style breaks as well as the Save and Cancel buttons disappear.

Workaround:

Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select keycloak.
Click Save and refresh the browser window to apply the changes.

StackLight¶

[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.

On a managed cluster, the StackLight pods may get stuck with the Pod predicate NodeAffinity failed error in the pod status. The issue may occur if the StackLight node label was added to one machine and then removed from another one.

The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.

As a workaround, remove the stuck pods:

kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>

[23006] StackLight endpoint crashes on start: private key does not match public key¶

Fixed in 2.17.0

In rare cases, StackLight applications may receive the wrong TLS certificates, which prevents them to start correctly.

As a workaround, delete the old secret for the affected StackLight component. For example, for iam-proxy-alerta:

kubectl -n stacklight delete secret iam-proxy-alerta-tls-certs

LCM¶

[22341] The cordon-drain states are not removed after maintenance mode is unset¶

Fixed in 2.17.0

The cordon-drain states are not removed after the maintenance mode is unset for a machine. This issue may occur due to the maintenance transition being stuck on the NodeWorkloadLock object.

Workaround:

Select from the following options:

Disable the maintenance mode on the affected cluster as described in Enable cluster and machine maintenance mode.

Edit LCMClusterState in the spec section by setting value to "false":

kubectl edit lcmclusterstates -n <projectName> <LCMCLusterStateName>

apiVersion: lcm.mirantis.com/v1alpha1
kind: LCMClusterState
metadata:
  ...
spec:
  ...
  value: "false"

Upgrade¶

[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶

Affects Ubuntu-based clusters deployed after Feb 10, 2022

If you deploy an Ubuntu-based cluster using the deprecated Cluster release 7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022, the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck while applying the Deploy state to the cluster machines. The issue affects all cluster types: management, regional, and managed.

To verify that the cluster is affected:

Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the Deploy state and identify the containerd package version:
```
containerd --version
```
If the version is 1.5.9, the cluster is affected.

In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:

The following packages will be upgraded:
  docker-ee docker-ee-cli
The following packages will be DOWNGRADED:
  containerd.io

STDERR:
E: Packages were downgraded and -y was used without --allow-downgrades.

Workaround:

Warning

Apply the steps below to the affected nodes one-by-one and only after each consecutive node gets stuck on the Deploy phase with the Ansible log errors. Such sequence ensures that each node is cordon-drained and Docker is properly stopped. Therefore, no workloads are affected.

Using SSH, log in to the first affected node and install containerd 1.5.8:

apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages

Wait for Ansible to reconcile. The node should become Ready in several minutes.
Wait for the next node of the cluster to get stuck on the Deploy phase with the Ansible log errors. Only after that, apply the steps above on the next node.
Patch the remaining nodes one-by-one using the steps above.

Container Cloud web UI¶

[249] A newly created project does not display in the Container Cloud web UI¶

Affects only Container Cloud 2.18.0 and earlier

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.

Cluster health¶

[21494] Controller pods are OOMkilled after deployment¶

Fixed in 2.17.0

After a successful deployment of a management or regional cluster, controller pods may be OOMkilled and get stuck in CrashLoopBackOff state due to incorrect memory limits.

Workaround:

Increase memory resources limits on the affected Deployment:

Open the affected Deployment configuration for editing:

kubectl --kubeconfig <mgmtOrRegionalKubeconfig> -n kaas edit deployment <deploymentName>

Update the value of spec.template.spec.containers.resources.limits by 100-200 Mi. For example:

spec:
  template:
    spec:
      containers:
      - ...
        resources:
          limits:
            cpu: "3"
            memory: 500Mi
          requests:
            cpu: "1"
            memory: 300Mi