Known issues¶
This section describes the MOSK known issues with available workarounds.
Bare metal¶
[56677] Failure to delete management cluster machine¶
When deleting a management cluster machine using CLI or the management console, the process may fail with the following error message:
Cleaning failed: Node 6810c34c-f833-4606-884b-9d285c935ed8 failed step {'interface': 'deploy', 'step': 'erase_devices_metadata' ...
'msg': 'Cleanup cannot be performed without target_storage spec'
Workaround:
Open the
BareMetalHostobject of the affected machine for editing.Set
automatedCleaningModefrommetadatatodisabled:spec: automatedCleaningMode: disabled
Save the object and proceed with the deletion.
Note
A cleanup of related hardware disks will not be executed.
Ceph¶
[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment¶
During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:
Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]
The issue occurs due to the following upstream Ceph issues:
To verify whether the cluster is affected:
Enter the
ceph-toolspod:Verify the Ceph cluster status:
Verify Ceph health:
ceph -sExample of a positive system response in the affected cluster:
cluster: id: 6ae41eb3-262e-4da9-8847-25efed2fcaa2 health: HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore services: mon: 3 daemons, quorum a,b,c (age 9h) mgr: a(active, since 9h), standbys: b osd: 4 osds: 4 up (since 9h), 4 in (since 9h) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 15 pools, 409 pgs objects: 1.67k objects, 4.6 GiB usage: 11 GiB used, 2.1 TiB / 2.1 TiB avail pgs: 409 active+clean io: client: 85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr
Verify Ceph health details:
ceph health detail
Example of a positive system response in the affected cluster:
HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore osd.2 observed slow operation indications in BlueStore osd.3 observed slow operation indications in BlueStore
Exit the
ceph-toolspod.
Workaround:
Configure the bluestore_slow_ops_warn options as follows:
kubectl -n ceph-lcm-mirantis edit cephdeployment
spec:
cephClusterSpec:
rookConfig:
osd|bluestore_slow_ops_warn_lifetime: "600"
osd|bluestore_slow_ops_warn_threshold: "10"
Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.
This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.
[58609] Ceph rebalancing gets stuck during disabled node removal¶
When disabling or removing a Ceph node during operations such as a rolling
reboot, Ceph may not finish rebalancing if only two of three OSD nodes remain
active. The CephDeployment object can remain in Maintenance, causing
the rebalance process to wait indefinitely for Ceph to become ready. The issue
may only affect environments with a small number of Ceph OSD nodes, pool
replica count set to one less than the number of storage nodes
(replicas=storage_nodes_count-1), and failure domain host.
As a workaround, run the following command for the affected Ceph OSD node:
ceph osd reweight <osdId> 0
Cluster update¶
[8106] Frequent node disconnections with mcc-keepalived forcing new election¶
After cluster update, some nodes may remain in an unstable Ready state
with mcc-keepalived constantly reelecting the leader and failing to
acquire the VIP address, which produces forcing new election messages in
logs.
Workaround:
Identify the leader node that owns the VIP:
On any control plane node, run the following command:
cat /etc/keepalived/keepalived.confIn the system response, capture the VIP used for the cluster.
Using the VIP, identify the leader node:
ip a| grep <VIP>
If the VIP is not found, run the command on another control plane node until you find the leader.
Connect to the non-leader control plane nodes and change the priority on these nodes in
keepalived.conf:vi /etc/keepalived/keepalived.confFor example, change the priority on each node to 150 and 200 respectively:
vrrp_instance VRRP1 { state MASTER garp_master_delay 15 interface k8s-lcm virtual_router_id 154 priority 100 # Change it on one node to 150 and on the other node to 200 virtual_ipaddress { 10.205.88.181 }Restart the
mcc-keepalivedservice on the control plane nodes where the priority was changed:systemctl restart mcc-keepalived
In 10-15 minutes, verify the logs of the node identified in step 1:
journalctl -u mcc-keepalived -f | grep election
You should no longer see the
forcing new electionmessages, and the flapping node status should be resolved.
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
Air-gapped cluster update¶
[59171] Validation for air-gapped cluster update fails due to content mismatch¶
Files listed in the air-gapped index may become unavailable at their upstream
locations, causing the sync operation to fail during the preparation of an
offline copy with the Size mismatch and Digest mismatch errors.
As a workaround, use the pre-built pool data provided by Mirantis support.
The pre-built pool data is the deduplicated blob storage used by
airgapped-tools. The sync operation verifies the local cache first.
Therefore, if a file with the correct size and checksum exists in the cache, it
is fetched from the local cache instead of being downloaded from upstream.
The pre-built pool data can vary in size, from a dozen megabytes to several
gigabytes. Large pool archives are split into 1 GiB chunks. A sha256
checksum is calculated for each chunk as well as for the unsplit tarball.
The following example illustrates the full list of files for the
mcr-repo-files_mcc-2.31.0 tarball:
SIZE FILENAME
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part00
108B mcr-repo-files_mcc-2.31.0.tar.gz.part00.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part01
108B mcr-repo-files_mcc-2.31.0.tar.gz.part01.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part02
108B mcr-repo-files_mcc-2.31.0.tar.gz.part02.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part03
108B mcr-repo-files_mcc-2.31.0.tar.gz.part03.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part04
108B mcr-repo-files_mcc-2.31.0.tar.gz.part04.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part05
108B mcr-repo-files_mcc-2.31.0.tar.gz.part05.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part06
108B mcr-repo-files_mcc-2.31.0.tar.gz.part06.sha256
1.0G mcr-repo-files_mcc-2.31.0.tar.gz.part07
108B mcr-repo-files_mcc-2.31.0.tar.gz.part07.sha256
386M mcr-repo-files_mcc-2.31.0.tar.gz.part08
108B mcr-repo-files_mcc-2.31.0.tar.gz.part08.sha256
101B mcr-repo-files_mcc-2.31.0.tar.gz.sha256
To use pre-built pool data:
Contact Mirantis support to obtain a download URL for the pre-built pool data.
Download all chunks (
*.tar.gz.partXX), their checksums (*.tar.gz.partXX.sha256), and the resulting unsplit tarball checksum (.tar.gz.sha256) files to the required directory. For example, to${AIRGAPPED_WORKSPACE}/tmp.Note
Run all commands below from the directory where you downloaded all chunks and checksum files.
Verify that the
sha256checksums are valid for each downloaded chunk:find . -name '*.tar.gz.part??.sha256' -exec sha256sum -c {} \;
Example of system response:
./mcr-repo-files_mcc-2.31.0.tar.gz.part00: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part02: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part06: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part04: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part01: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part03: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part07: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part05: OK ./mcr-repo-files_mcc-2.31.0.tar.gz.part08: OK
Concatenate all chunks into a single tarball in numerical order (
part00,part01,part02, …). For example:Note
In the command below, replace
mcr-repo-files_mcc-2.31.0with the file name prefix of the tarball you downloaded. To obtain the correct tarball file name prefix:find . -name '*.tar.gz.sha256' -type f -maxdepth 1 -exec sh -c 'basename {} .sha256' \;
cat *.part?? > mcr-repo-files_mcc-2.31.0.tar.gz
Verify that the concatenated tarball has a valid checksum. For example:
sha256sum -c mcr-repo-files_mcc-2.31.0.tar.gz.sha256
Example of a positive system response:
mcr-repo-files_mcc-2.31.0.tar.gz: OKUnpack the tarball into the files blob directory under
AIRGAPPED_WORKSPACE, which was previously set using theAIRGAPPED_WORKSPACEenvironment variable:find . -name '*.tar.gz' -exec tar -xzf {} -C "${AIRGAPPED_WORKSPACE}/files" \;
Proceed with the sync, validation, and export steps 3-5 described in Fetch the release_files tarball and sync data.
LCM¶
[42889] Graceful reboot gets stuck when Kubernetes and OpenStack control planes are drained simultaneously¶
When a GracefulRebootRequest targets both the Kubernetes and OpenStack
control plane machines, either by listing machines of both types in
spec.machines or by leaving the list empty to reboot all cluster nodes,
the rolling reboot may get stuck. This happens because both node groups are
drained in parallel, and the OpenStack workload manager running on the
Kubernetes control plane becomes unavailable while the OpenStack control plane
nodes are simultaneously being drained.
Workaround:
Identify the machines that have not yet been rebooted.
Delete the stuck
GracefulRebootRequest:kubectl -n <projectName> delete gracefulrebootrequest <gracefulRebootRequestName>
Recreate the reboot requests in two sequential steps as described in Perform a rolling reboot of a cluster using CLI: first for the Kubernetes control plane machines, then, once that request completes and is deleted, for the remaining machines that still require a reboot.
MOSK management console¶
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the MOSK
management console and contains different access denied errors during first
five minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.
OpenSDN¶
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config or
analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable
the tracking_records_in_ets during the initialization process.
To work around the issue, restart the affected pods manually.
[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶
After replacing a failed Tungsten Fabric controller node as described in Replace a failed OpenSDN controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.
To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status
Example of the system response with outdated IP addresses:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.201.144 509.43 KiB 256 ? 7e760a99-fae5-4921-b0c5-d9e6e1eca1c5 rack1
UN 192.168.50.146 534.18 KiB 256 ? 2248ea35-85d4-4887-820b-1fac4733021f rack1
UN 192.168.145.147 484.19 KiB 256 ? d988aaaa-44ae-4fec-a617-0b0a253e736d rack1
DN 192.168.145.144 481.53 KiB 256 ? c23703a1-6854-47a7-a4a2-af649d63af0c rack1
An extra node will appear in the cluster with an outdated IP address
(the IP of the terminated Cassandra pod) in the Down state.
To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>
[51101] tf-config pods fail to process API calls¶
The OpenSDN tf-config pods may fail to process API calls when the uWSGI
listen queue is full. As a result, pods report Unhealthy and OpenSDN
deployments can fail. In the pod logs, repeated messages appear such as:
*** uWSGI listen queue of socket "10.10.0.155:8082" (fd: 3) full !!!
(101/100) ***
Workaround:
Delete all tf-config pods one by one so they are recreated.
List the
tf-configpods:kubectl get pods -l tungstenfabric=config -n tf
Delete one
tf-configpod:kubectl delete pod <POD_NAME> -n tf
Wait for the new pod to be created.
Verify that the new pod has status
Runningand the restart count does not increase:kubectl get pods -l tungstenfabric=config -n tf
Example of a positive system response:
tf-config-jcfrr 4/4 Running 0 2m
Repeat steps 2-3 for the remaining
tf-configpods one by one.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysqldirectory on themariadb-serverPod.Verify that other replicas are up and ready.
Remove the
galera.cachefile for the affectedmariadb-serverPod.Remove the affected
mariadb-serverPod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[53401] Credential rotation reports success without performing action¶
Occasionally, the password rotation procedure for admin or service
credentials may incorrectly report success without actually initiating
the rotation process. This can result in unchanged credentials despite
the procedure indicating completion.
To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.
[54570] The rfs-openstack-redis pod gets stuck in the Completed state¶
After node reboot, the rfs-openstack-redis pod may get stuck in the
Completed state blocking synchronization of the Redis cluster.
As a workaround, delete the rfs-openstack-redis pod that remains in the
Completed state:
kubectl -n openstack-redis delete <pod-name>
[57473] OpenStack update fails due to neutron-ovs-agent-default start failure¶
During OpenStack update from Caracal to Epoxy, neutron-ovs-agent-default
may fail to start with the The DaemonSet neutron-ovs-agent-default is not
ready error in the Rockoon logs due to Kopf missing the OsDpl update
events.
As a workaround, recreate the rockoon pod of the affected MOSK cluster:
kubectl -n osh-system rollout restart deployment rockoon
Security¶
[58728] The managed: false field is added for auditd after cluster update¶
After update of a management cluster to 2.31.0, the managed: false field
is added to the auditd configuration in the Cluster object of MOSK
clusters that have auditd enabled. This behaviour is expected and does not
affect the auditd functionality. Therefore, no action is required before MOSK
cluster update to 26.1.
For release changes in the auditd configuration and actions required after the MOSK cluster update to 26.1, see Migration of the auditd settings from the Cluster object to the auditd module.
StackLight¶
[48581] OpenSearchClusterStatusCritical is firing during cluster update¶
During update of a management or MOSK cluster with StackLight enabled in HA
mode, the OpenSearchClusterStatusCritical alert may trigger when the next
OpenSearch node restarts before shards from the previous node finish assigning.
This can push some indices to red temporarily, making them unavailable for
reads and writes, possibly causing some new logs being lost.
The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.
[55317] The Dropped sample for series errors in the Prometheus logs¶
When the experimental feature memory-snapshot-on-shutdown, which is enabled
by default, is used together with remote_write, Prometheus may emit
multiple log messages, such as Dropped sample for series that was not
explicitly dropped via relabelling. For more details, see the upstream issue
description in the Prometheus GitHub project.
Workaround:
On the related management cluster, open the affected MOSK
Clusterobject for editing:kubectl edit cluster <affectedMOSKClusterName> -n <affectedMOSKClusterProjectName>
Remove the
memory-snapshot-on-shutdownfeature from theprometheusServer:enabledFeatureslist:spec: ... providerSpec: ... value: ... helmReleases: ... - name: stacklight values: ... prometheusServer: enabledFeatures: []