Known issues

This section describes the MOSK known issues with available workarounds.

Bare metal

[56677] Failure to delete management cluster machine

When deleting a management cluster machine using CLI or the management console, the process may fail with the following error message:

Cleaning failed: Node 6810c34c-f833-4606-884b-9d285c935ed8 failed step {'interface': 'deploy', 'step': 'erase_devices_metadata' ...
'msg': 'Cleanup cannot be performed without target_storage spec'

Workaround:

  1. Open the BareMetalHost object of the affected machine for editing.

  2. Set automatedCleaningMode from metadata to disabled:

    spec:
      automatedCleaningMode: disabled
    
  3. Save the object and proceed with the deletion.

    Note

    A cleanup of related hardware disks will not be executed.

Ceph

[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment

During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]

The issue occurs due to the following upstream Ceph issues:

To verify whether the cluster is affected:

  1. Enter the ceph-tools pod:

  2. Verify the Ceph cluster status:

    • Verify Ceph health:

      ceph -s
      

      Example of a positive system response in the affected cluster:

      cluster:
        id:     6ae41eb3-262e-4da9-8847-25efed2fcaa2
        health: HEALTH_WARN
                2 OSD(s) experiencing slow operations in BlueStore
      
      services:
        mon: 3 daemons, quorum a,b,c (age 9h)
        mgr: a(active, since 9h), standbys: b
        osd: 4 osds: 4 up (since 9h), 4 in (since 9h)
        rgw: 2 daemons active (2 hosts, 1 zones)
      
      data:
        pools:   15 pools, 409 pgs
        objects: 1.67k objects, 4.6 GiB
        usage:   11 GiB used, 2.1 TiB / 2.1 TiB avail
        pgs:     409 active+clean
      
      io:
        client:   85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr
      
    • Verify Ceph health details:

      ceph health detail
      

      Example of a positive system response in the affected cluster:

      HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore
      [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore
           osd.2 observed slow operation indications in BlueStore
           osd.3 observed slow operation indications in BlueStore
      
  3. Exit the ceph-tools pod.

Workaround:

Configure the bluestore_slow_ops_warn options as follows:

kubectl -n ceph-lcm-mirantis edit cephdeployment
spec:
  cephClusterSpec:
    rookConfig:
      osd|bluestore_slow_ops_warn_lifetime: "600"
      osd|bluestore_slow_ops_warn_threshold: "10"

Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.

This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.

[58609] Ceph rebalancing gets stuck during disabled node removal

When disabling or removing a Ceph node during operations such as a rolling reboot, Ceph may not finish rebalancing if only two of three OSD nodes remain active. The CephDeployment object can remain in Maintenance, causing the rebalance process to wait indefinitely for Ceph to become ready. The issue may only affect environments with a small number of Ceph OSD nodes, pool replica count set to one less than the number of storage nodes (replicas=storage_nodes_count-1), and failure domain host.

As a workaround, run the following command for the affected Ceph OSD node:

ceph osd reweight <osdId> 0

Cluster update

[8106] Frequent node disconnections with mcc-keepalived forcing new election

After cluster update, some nodes may remain in an unstable Ready state with mcc-keepalived constantly reelecting the leader and failing to acquire the VIP address, which produces forcing new election messages in logs.

Workaround:

  1. Identify the leader node that owns the VIP:

    1. On any control plane node, run the following command:

      cat /etc/keepalived/keepalived.conf
      

      In the system response, capture the VIP used for the cluster.

    2. Using the VIP, identify the leader node:

      ip a| grep <VIP>
      

      If the VIP is not found, run the command on another control plane node until you find the leader.

  2. Connect to the non-leader control plane nodes and change the priority on these nodes in keepalived.conf:

    vi /etc/keepalived/keepalived.conf
    

    For example, change the priority on each node to 150 and 200 respectively:

    vrrp_instance VRRP1 {
        state MASTER
        garp_master_delay 15
        interface k8s-lcm
        virtual_router_id 154
        priority 100        # Change it on one node to 150 and on the other node to 200
        virtual_ipaddress {
            10.205.88.181
        }
    
  3. Restart the mcc-keepalived service on the control plane nodes where the priority was changed:

    systemctl restart mcc-keepalived
    
  4. In 10-15 minutes, verify the logs of the node identified in step 1:

    journalctl -u mcc-keepalived -f | grep election
    

    You should no longer see the forcing new election messages, and the flapping node status should be resolved.

[42449] Rolling reboot failure on a Tungsten Fabric cluster

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

Air-gapped cluster update

[59171] Validation for air-gapped cluster update fails due to content mismatch

Files listed in the air-gapped index may become unavailable at their upstream locations, causing the sync operation to fail during the preparation of an offline copy with the Size mismatch and Digest mismatch errors.

As a workaround, use the pre-built pool data provided by Mirantis support.

The pre-built pool data is the deduplicated blob storage used by airgapped-tools. The sync operation verifies the local cache first. Therefore, if a file with the correct size and checksum exists in the cache, it is fetched from the local cache instead of being downloaded from upstream.

The pre-built pool data can vary in size, from a dozen megabytes to several gigabytes. Large pool archives are split into 1 GiB chunks. A sha256 checksum is calculated for each chunk as well as for the unsplit tarball. The following example illustrates the full list of files for the mcr-repo-files_mcc-2.31.0 tarball:

SIZE   FILENAME
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part00
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part00.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part01
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part01.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part02
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part02.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part03
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part03.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part04
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part04.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part05
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part05.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part06
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part06.sha256
1.0G   mcr-repo-files_mcc-2.31.0.tar.gz.part07
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part07.sha256
386M   mcr-repo-files_mcc-2.31.0.tar.gz.part08
108B   mcr-repo-files_mcc-2.31.0.tar.gz.part08.sha256
101B   mcr-repo-files_mcc-2.31.0.tar.gz.sha256

To use pre-built pool data:

  1. Contact Mirantis support to obtain a download URL for the pre-built pool data.

  2. Download all chunks (*.tar.gz.partXX), their checksums (*.tar.gz.partXX.sha256), and the resulting unsplit tarball checksum (.tar.gz.sha256) files to the required directory. For example, to ${AIRGAPPED_WORKSPACE}/tmp.

    Note

    Run all commands below from the directory where you downloaded all chunks and checksum files.

  3. Verify that the sha256 checksums are valid for each downloaded chunk:

    find . -name '*.tar.gz.part??.sha256' -exec sha256sum -c {} \;
    

    Example of system response:

    ./mcr-repo-files_mcc-2.31.0.tar.gz.part00: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part02: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part06: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part04: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part01: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part03: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part07: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part05: OK
    ./mcr-repo-files_mcc-2.31.0.tar.gz.part08: OK
    
  4. Concatenate all chunks into a single tarball in numerical order (part00, part01, part02, …). For example:

    Note

    In the command below, replace mcr-repo-files_mcc-2.31.0 with the file name prefix of the tarball you downloaded. To obtain the correct tarball file name prefix:

    find . -name '*.tar.gz.sha256' -type f -maxdepth 1 -exec sh -c 'basename {} .sha256' \;
    
    cat *.part?? > mcr-repo-files_mcc-2.31.0.tar.gz
    
  5. Verify that the concatenated tarball has a valid checksum. For example:

    sha256sum -c mcr-repo-files_mcc-2.31.0.tar.gz.sha256
    

    Example of a positive system response:

    mcr-repo-files_mcc-2.31.0.tar.gz: OK
    
  6. Unpack the tarball into the files blob directory under AIRGAPPED_WORKSPACE, which was previously set using the AIRGAPPED_WORKSPACE environment variable:

    find . -name '*.tar.gz' -exec tar -xzf {} -C "${AIRGAPPED_WORKSPACE}/files" \;
    
  7. Proceed with the sync, validation, and export steps 3-5 described in Fetch the release_files tarball and sync data.

LCM

[42889] Graceful reboot gets stuck when Kubernetes and OpenStack control planes are drained simultaneously

When a GracefulRebootRequest targets both the Kubernetes and OpenStack control plane machines, either by listing machines of both types in spec.machines or by leaving the list empty to reboot all cluster nodes, the rolling reboot may get stuck. This happens because both node groups are drained in parallel, and the OpenStack workload manager running on the Kubernetes control plane becomes unavailable while the OpenStack control plane nodes are simultaneously being drained.

Workaround:

  1. Identify the machines that have not yet been rebooted.

  2. Delete the stuck GracefulRebootRequest:

    kubectl -n <projectName> delete gracefulrebootrequest <gracefulRebootRequestName>
    
  3. Recreate the reboot requests in two sequential steps as described in Perform a rolling reboot of a cluster using CLI: first for the Kubernetes control plane machines, then, once that request completes and is deleted, for the remaining machines that still require a reboot.

MOSK management console

[50168] Inability to use a new project right after creation

A newly created project does not display all available tabs in the MOSK management console and contains different access denied errors during first five minutes after creation.

To work around the issue, refresh the browser in five minutes after the project creation.

OpenSDN

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the issue, restart the affected pods manually.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node

After replacing a failed Tungsten Fabric controller node as described in Replace a failed OpenSDN controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>

[51101] tf-config pods fail to process API calls

The OpenSDN tf-config pods may fail to process API calls when the uWSGI listen queue is full. As a result, pods report Unhealthy and OpenSDN deployments can fail. In the pod logs, repeated messages appear such as:

*** uWSGI listen queue of socket "10.10.0.155:8082" (fd: 3) full !!!
(101/100) ***

Workaround:

Delete all tf-config pods one by one so they are recreated.

  1. List the tf-config pods:

    kubectl get pods -l tungstenfabric=config -n tf
    
  2. Delete one tf-config pod:

    kubectl delete pod <POD_NAME> -n tf
    

    Wait for the new pod to be created.

  3. Verify that the new pod has status Running and the restart count does not increase:

    kubectl get pods -l tungstenfabric=config -n tf
    

    Example of a positive system response:

    tf-config-jcfrr     4/4     Running     0     2m
    
  4. Repeat steps 2-3 for the remaining tf-config pods one by one.

OpenStack

[31186,34132] Pods get stuck during MariaDB operations

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

  1. Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.

  2. Verify that other replicas are up and ready.

  3. Remove the galera.cache file for the affected mariadb-server Pod.

  4. Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[53401] Credential rotation reports success without performing action

Occasionally, the password rotation procedure for admin or service credentials may incorrectly report success without actually initiating the rotation process. This can result in unchanged credentials despite the procedure indicating completion.

To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.

[54570] The rfs-openstack-redis pod gets stuck in the Completed state

After node reboot, the rfs-openstack-redis pod may get stuck in the Completed state blocking synchronization of the Redis cluster.

As a workaround, delete the rfs-openstack-redis pod that remains in the Completed state:

kubectl -n openstack-redis delete <pod-name>

[57473] OpenStack update fails due to neutron-ovs-agent-default start failure

During OpenStack update from Caracal to Epoxy, neutron-ovs-agent-default may fail to start with the The DaemonSet neutron-ovs-agent-default is not ready error in the Rockoon logs due to Kopf missing the OsDpl update events.

As a workaround, recreate the rockoon pod of the affected MOSK cluster:

kubectl -n osh-system rollout restart deployment rockoon

Security

[58728] The managed: false field is added for auditd after cluster update

After update of a management cluster to 2.31.0, the managed: false field is added to the auditd configuration in the Cluster object of MOSK clusters that have auditd enabled. This behaviour is expected and does not affect the auditd functionality. Therefore, no action is required before MOSK cluster update to 26.1.

For release changes in the auditd configuration and actions required after the MOSK cluster update to 26.1, see Migration of the auditd settings from the Cluster object to the auditd module.

StackLight

[48581] OpenSearchClusterStatusCritical is firing during cluster update

During update of a management or MOSK cluster with StackLight enabled in HA mode, the OpenSearchClusterStatusCritical alert may trigger when the next OpenSearch node restarts before shards from the previous node finish assigning. This can push some indices to red temporarily, making them unavailable for reads and writes, possibly causing some new logs being lost.

The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.

[55317] The Dropped sample for series errors in the Prometheus logs

When the experimental feature memory-snapshot-on-shutdown, which is enabled by default, is used together with remote_write, Prometheus may emit multiple log messages, such as Dropped sample for series that was not explicitly dropped via relabelling. For more details, see the upstream issue description in the Prometheus GitHub project.

Workaround:

  1. On the related management cluster, open the affected MOSK Cluster object for editing:

    kubectl edit cluster <affectedMOSKClusterName> -n <affectedMOSKClusterProjectName>
    
  2. Remove the memory-snapshot-on-shutdown feature from the prometheusServer:enabledFeatures list:

    spec:
      ...
      providerSpec:
        ...
        value:
          ...
          helmReleases:
            ...
            - name: stacklight
              values:
                ...
                prometheusServer:
                  enabledFeatures: []