[46245] Lack of access permissions for HOC and HOCM objects¶
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
After managed cluster update, old versions of system packages, including
kernel, may remain on the manager nodes. This issue occurs because the task
responsible for updating packages fails to run after updating Ubuntu mirrors.
As a workaround, manually run apt-get upgrade on every manager
node after the cluster update but before rebooting the node.
[41305] DHCP responses are lost between dnsmasq and dhcp-relay pods¶
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
Due to the upstream MariaDB issue,
during MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
Create a backup of the /var/lib/mysql directory on the
mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically
restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes
and restores the quorum.
[30294] Replacement of a master node is stuck on the calico-node Pod start¶
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
[5568] The calico-kube-controllers Pod fails to clean up resources¶
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.