This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.27.2 including the Cluster releases 16.2.2,
16.1.7, and 17.1.7.
[47202] Inspection error on bare metal hosts after dnsmasq restart¶
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin: