When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
Due to the upstream MariaDB issue,
during MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
Create a backup of the /var/lib/mysql directory on the
mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically
restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes
and restores the quorum.
[30294] Replacement of a master node is stuck on the calico-node Pod start¶
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
[5568] The calico-kube-controllers Pod fails to clean up resources¶
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
[42908] The ceph-exporter pods are present in the Ceph crash list¶
After a managed cluster update, the ceph-exporter pods are present in the
ceph crash ls list while rook-ceph-exporter attempts to obtain
the port that is still in use. The issue does not block the managed cluster
update. Once the port becomes available, rook-ceph-exporter obtains the
port and the issue disappears.
As a workaround, run ceph crash archive-all to remove
ceph-exporter pods from the Ceph crash list.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state¶
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .Values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .Values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.Values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
[43164] Rollover policy is not added to indicies created without a policy¶
The initial index for the system* and audit* data streams can be
created without any policy attached due to race condition.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.\
<class'curator.exceptions.FailedExecution'>:Exceptionencountered.\
RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.\
Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] \is the write index for data stream [system] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.\
<class'curator.exceptions.FailedExecution'>:Exceptionencountered.\
RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.\
Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] \is the write index for data stream [audit] and cannot be deleted')
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.