Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud releases 2.24.0 and 2.24.1 including the Cluster release 14.0.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
Bare metal¶
[35429] The Wireguard interface does not have the IPv4 address assigned¶
Due to the upstream Calico issue, on clusters with Wireguard enabled, the Wireguard interface on a node may not have the IPv4 address assigned. This leads to broken inter-Pod communication between the affected node and other cluster nodes.
The node is affected if the IP address is missing on the Wireguard interface:
ip a show wireguard.cali
Example of system response:
40: wireguard.cali: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000 link/none
The workaround is to manually restart the calico-node
Pod to allocate
the IPv4 address on the Wireguard interface:
docker restart $(docker ps -f "label=name=Calico node" -q)
[34280] No reconcile events generated during cluster update¶
The cluster update is stuck on waiting for agents to upgrade with the following message in the cluster status:
Helm charts are not installed(upgraded) yet. Not ready releases: managed-lcm-api
The workaround is to retrigger the cluster update, for example, by adding an annotation to the cluster object:
Log in to a local machine where your management cluster
kubeconfig
is located and where kubectl is installed.Open the management
Cluster
object for editing:kubectl edit cluster <mgmtClusterName>
Set the annotation
force-reconcile: true
.
[34210] Helm charts installation failure during cluster update¶
The cluster update is blocked with the following message in the cluster status:
Helm charts are not installed(upgraded) yet.
Not ready releases: iam, managed-lcm-api, admission-controller, baremetal-operator.
Workaround:
Log in to a local machine where your management cluster
kubeconfig
is located and wherekubectl
is installed.Open the
baremetal-operator
deployment object for editing:kubectl edit deploy -n kaas baremetal-operator
Modify the image that the
init
container and the container are using tomirantis.azurecr.io/bm/baremetal-operator:base-alpine-20230721153358
.
The baremetal-operator
pods will be re-created, and the cluster update
will get unblocked.
[33936] Deletion failure of a controller node during machine replacement¶
Due to the upstream Calico issue, a controller node
cannot be deleted if the calico-node
Pod is stuck blocking node deletion.
One of the symptoms is the following warning in the baremetal-operator
logs:
Resolving dependency Service dhcp-lb in namespace kaas failed: \
the server was unable to return a response in the time allotted,\
but may still be processing the request (get endpoints dhcp-lb).
As a workaround, delete the Pod that is stuck to retrigger the node deletion.
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic
Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
Terminating
stateA new
ironic
Pod fails to startThe related bare metal host is stuck in the
deprovisioning
state
As a workaround, before deletion of the node running the ironic
Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region
label of the affected
region:
cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
LCM¶
[34132] Pods get stuck during MariaDB operations¶
Due to the upstream MariaDB issue, during MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Log in to the node where the affected Pod is running.
In
/mnt/local-volumes/src/iam/kaas-iam-data/vol00/
, remove thegalera.cache
file for the affected Pod.Remove the affected Pod or wait until it is automatically restarted.
[32761] Node cleanup fails due to remaining devices¶
On MOSK clusters, the Ansible provisioner may hang in a loop while trying to remove LVM thin pool logical volumes (LVs) due to issues with volume detection before removal. The Ansible provisioner cannot remove LVM thin pool LVs correctly, so it consistently detects the same volumes whenever it scans disks, leading to a repetitive cleanup process.
The following symptoms mean that a cluster can be affected:
A node was configured to use thin pool LVs. For example, it had the OpenStack Cinder role in the past.
A bare metal node deployment flaps between
provisioninig
anddeprovisioning
states.In the Ansible provisioner logs, the following example warnings are growing:
88621.log:7389:2023-06-22 16:30:45.109 88621 ERROR ansible.plugins.callback.ironic_log [-] Ansible task clean : fail failed on node 14eb0dbc-c73a-4298-8912-4bb12340ff49: {'msg': 'There are more devices to clean', '_ansible_no_log': None, 'changed': False}
Important
There are more devices to clean
is a regular warning indicating some in-progress tasks. But if the number of such warnings is growing along with the node flapping betweenprovisioninig
anddeprovisioning
states, the cluster is highly likely affected by the issue.
As a workaround, erase disks manually using any preferred tool.
[34247] MKE backup fails during cluster update¶
MKE backup may fail during update of a management, regional, or managed
cluster due to wrong permissions in the etcd backup
/var/lib/docker/volumes/ucp-backup/_data
directory.
The issue affects only clusters that were originally deployed using early Container Cloud releases delivered in 2020-2021.
Workaround:
Fix permissions on all affected nodes:
chown -R nobody:nogroup /var/lib/docker/volumes/ucp-backup/_data
Using the admin
kubeconfig
, increase themkeUpgradeAttempts
value:Open the
LCMCluster
object of the management cluster for editing:kubectl edit lcmcluster <mgmtClusterName>
In the
mkeUpgradeAttempts
field, increase the value to6
. Once done, MKE backup retriggers automatically.
[30294] Replacement of a ‘master’ node is stuck on the ‘calico-node’ Pod start¶
During replacement of a master
node on a cluster of any type, the
calico-node
Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any
master
node.From a CLI with an MKE client bundle, create a shell alias to start calicoctl using the
mirantis/ucp-dsinfo
image:alias calicoctl="\ docker run -i --rm \ --pid host \ --net host \ -e constraint:ostype==linux \ -e ETCD_ENDPOINTS=<etcdEndpoint> \ -e ETCD_KEY_FILE=/ucp-node-certs/key.pem \ -e ETCD_CA_CERT_FILE=/ucp-node-certs/ca.pem \ -e ETCD_CERT_FILE=/ucp-node-certs/cert.pem \ -v /var/run/calico:/var/run/calico \ -v ucp-node-certs:/ucp-node-certs:ro \ mirantis/ucp-dsinfo:<mkeVersion> \ calicoctl --allow-version-mismatch \ "
In the above command, replace the following values with the corresponding settings of the affected cluster:
<etcdEndpoint>
is the etcd endpoint defined in the Calico configuration file. For example,ETCD_ENDPOINTS=127.0.0.1:12378
<mkeVersion>
is the MKE version installed on your cluster. For example,mirantis/ucp-dsinfo:3.5.7
.
Verify the node list on the cluster:
kubectl get node
Compare this list with the node list in Calico to identify the old node:
calicoctl get node -o wide
Remove the old node from Calico:
calicoctl delete node kaas-node-<nodeID>
[5782] Manager machine fails to be deployed during node replacement¶
During replacement of a manager machine, the following problems may occur:
The system adds the node to Docker swarm but not to Kubernetes
The node
Deployment
gets stuck with failed RethinkDB health checks
Workaround:
Delete the failed node.
Wait for the MKE cluster to become healthy. To monitor the cluster status:
Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.
Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.
Deploy a new node.
[5568] The ‘calico-kube-controllers’ Pod fails to clean up resources¶
During the unsafe
or forced
deletion of a manager machine running the
calico-kube-controllers
Pod in the kube-system
namespace,
the following issues occur:
The
calico-kube-controllers
Pod fails to clean up resources associated with the deleted nodeThe
calico-node
Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers
Pod, cordon and drain the node:
kubectl cordon <nodeName>
kubectl drain <nodeName>
Ceph¶
[34599] Ceph ‘ClusterWorkloadLock’ blocks upgrade from 2.23.5 to 2.24.1¶
On management clusters based on Ubuntu 18.04, after the cluster starts upgrading from 2.23.5 to 2.24.1, all controller machines are stuck in the In Progress state with the Distribution update in progress hover message displaying in the Container Cloud web UI.
The issue is caused by clusterworkloadlock
containing the outdated
release name in the status.release
field, which blocks LCM Controller
to proceed with machine upgrade. This behavior is caused by a complete removal
of the ceph-controller
chart from management clusters and a failed
ceph-clusterworkloadlock
removal.
The workaround is to manually remove ceph-clusterworkloadlock
from the
management cluster to unblock upgrade:
kubectl delete clusterworkloadlock ceph-clusterworkloadlock
[26441] Cluster update fails with the ‘MountDevice failed for volume’ warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim
getting stuck in the Pending
state for the
prometheus-server
StatefulSet and the
MountVolume.MountDevice failed for volume
warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
In the command above, replace the following values:
<affectedProjectName>
is the Container Cloud project name where the Pods failed to run<affectedPodName>
is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain therbd volume mount failed: <csi-vol-uuid> is being used
error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the Pod that fails to0
replicas.On every
csi-rbdplugin
Pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected Pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state becomesRunning
.
Update¶
[33438] ‘CalicoDataplaneFailuresHigh’ alert is firing during cluster update¶
During cluster update of a managed bare metal cluster, the false positive
CalicoDataplaneFailuresHigh
alert may be firing. Disregard this alert,
which will disappear once cluster update succeeds.
The observed behavior is typical for calico-node
during upgrades,
as workload changes occur frequently. Consequently, there is a possibility
of temporary desynchronization in the Calico dataplane. This can occasionally
result in throttling when applying workload changes to the Calico dataplane.