This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.25.0 including the Cluster releases
17.0.0, 16.0.0, and
14.1.0.
An arbitrary Kubernetes pod may get stuck in an error loop due to a failed
Calico networking setup for that pod. The pod cannot access any network
resources. The issue occurs more often during cluster upgrade or node
replacement, but this can sometimes happen during the new deployment as well.
You may find the following log for the failed pod IP (for example,
10.233.121.132) in calico-node logs:
Due to the upstream Calico issue, a controller node
cannot be deleted if the calico-node Pod is stuck blocking node deletion.
One of the symptoms is the following warning in the baremetal-operator
logs:
Resolving dependency Service dhcp-lb in namespace kaas failed:\
the server was unable to return a response in the time allotted,\
but may still be processing the request (get endpoints dhcp-lb).
As a workaround, delete the Pod that is stuck to retrigger the node
deletion.
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
When using OpenStackCredential with a custom CACert, a management or
managed cluster deployment or upgrade is blocked by all pods being stuck in
the Pending state. The issue is caused by incorrect secrets being used to
initialize the OpenStack external Cloud Provider Interface.
As a workaround, copy CACert from the OpenStackCredential object
to openstack-ca-secret:
A sign-in to the MKE web UI of the management cluster using the
Sign in with External Provider option can fail with the
invalid parameter: redirect_uri error.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the IAM realm.
Navigate to Clients > kaas.
On the page, navigate to
Seetings > Access settings > Valid redirect URIs.
Add https://<mgmtmkeip>:6443/* to the list of valid redirect URIs
and click Save.
[31186,34132] Pods get stuck during MariaDB operations¶
Due to the upstream MariaDB issue,
during MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
On MOSK clusters, the Ansible provisioner may hang in a loop while trying to
remove LVM thin pool logical volumes (LVs) due to issues with volume detection
before removal. The Ansible provisioner cannot remove LVM thin pool LVs
correctly, so it consistently detects the same volumes whenever it scans
disks, leading to a repetitive cleanup process.
The following symptoms mean that a cluster can be affected:
A node was configured to use thin pool LVs. For example, it had the
OpenStack Cinder role in the past.
A bare metal node deployment flaps between provisioninig and
deprovisioning states.
In the Ansible provisioner logs, the following example warnings are growing:
88621.log:7389:2023-06-22 16:30:45.109 88621 ERROR ansible.plugins.callback.ironic_log[-] Ansible task clean:fail failed on node 14eb0dbc-c73a-4298-8912-4bb12340ff49:{'msg':'There are more devices to clean', '_ansible_no_log': None, 'changed': False}
Important
Therearemoredevicestoclean is a regular warning
indicating some in-progress tasks. But if the number of such warnings is
growing along with the node flapping between provisioninig and
deprovisioning states, the cluster is highly likely affected by the
issue.
As a workaround, erase disks manually using any preferred tool.
[30294] Replacement of a master node is stuck on the calico-node Pod start¶
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
[5568] The calico-kube-controllers Pod fails to clean up resources¶
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
Due to the upstream Ceph issue,
on clusters with the Federal Information Processing Standard (FIPS) mode
enabled, the Ceph rook-operator fails to connect to Ceph RADOS Gateway
(RGW) pods.
As a workaround, do not place Ceph RGW pods on nodes where FIPS mode is
enabled.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Container Cloud upgrade may be blocked by a node being stuck in the Prepare
or Deploy state with errorprocessingpackageopenssh-server.
The issue is caused by customizations in /etc/ssh/sshd_config, such as
additional Match statements. This file is managed by Container Cloud and
must not be altered manually.
As a workaround, move customizations from sshd_config to a new file
in the /etc/ssh/sshd_config.d/ directory.
[36928] The helm-controllerDeployment is stuck during cluster update¶
During a cluster update, a Kubernetes helm-controllerDeployment may
get stuck in a restarting Pod loop with Terminating and Running states
flapping. Other Deployment types may also be affected.
As a workaround, restart the Deployment that got stuck:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name containing
the cluster with stuck Pods
<affectedDeployName> is the Deployment name that failed to run Pods
in the specified project
<replicasNumber> is the original number of replicas for the
Deployment that you can obtain using the get deploy command
[33438] ‘CalicoDataplaneFailuresHigh’ alert is firing during cluster update¶
During cluster update of a managed bare metal cluster, the false positive
CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert,
which will disappear once cluster update succeeds.
The observed behavior is typical for calico-node during upgrades,
as workload changes occur frequently. Consequently, there is a possibility
of temporary desynchronization in the Calico dataplane. This can occasionally
result in throttling when applying workload changes to the Calico dataplane.