Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!

Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly MCC). This means everything you need is in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.

Known issues

This section describes the MOSK known issues with available workarounds.

OpenStack

[31186,34132] Pods get stuck during MariaDB operations

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

  1. Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.

  2. Verify that other replicas are up and ready.

  3. Remove the galera.cache file for the affected mariadb-server Pod.

  4. Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[42386] A load balancer service does not obtain the external IP address

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

  1. Identify the service that is stuck:

    kubectl get svc -A | grep pending
    

    Example of system response:

    stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP
    
  2. Add an arbitrary label to the service that is stuck. For example:

    kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
    

    Example of system response:

    service/iam-proxy-prometheus labeled
    
  3. Verify that the external IP was allocated to the service:

    kubectl get svc -n stacklight iam-proxy-prometheus
    

    Example of system response:

    NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
    iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d
    

[51127] Replaced node creates a new OVN DB cluster instead of rejoining

After a node replacement, a new OVN database cluster is created instead of the replaced node rejoining the existing cluster, causing pod initialization issues.

If you encounter the issue, please contact Mirantis support for the workaround.

[53401] Credential rotation reports success without performing action

Occasionally, the password rotation procedure for admin or service credentials may incorrectly report success without actually initiating the rotation process. This can result in unchanged credentials despite the procedure indicating completion.

To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.

[54416] OpenStackDeployment is stuck with APPLYING after cluster update

After cluster update is completed, the OpenStackDeployment state is stuck with the APPLYING status.

As a workaround, restart the rockoon pod.

[54430] AMQP message delivery fails when message size exceeds RabbitMQ limit

When sending large messages, the following error might appear in logs:

oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on openstack-neutron-rabbitmq-rabbitmq-0.rabbitmq-neutron.openstack.svc.cluster.local:5672 after inf tries: Basic.publish: (406) PRECONDITION_FAILED - message size 40744975 is larger than configured max size 16777216

This error occurs because the message size (for example, 40744975 bytes) exceeds the configured RabbitMQ maximum message size (for example, 16777216 bytes), potentially causing discruption of the services relying on the AMQP message.

To work around the issue, increase the max_message_size setting in the OpenStackDeployment custom rsource before updating to MOSK 25.2:

services:
  networking:
    rabbitmq:
      values:
        conf:
          rabbitmq:
            max_message_size: "67108864"

[54570] The rfs-openstack-redis pod gets stuck in the Completed state

After node reboot, the rfs-openstack-redis pod may get stuck in the Completed state blocking synchronization of the Redis cluster.

As a workaround, delete the rfs-openstack-redis pod that remains in the Completed state:

kubectl -n openstack-redis delete <pod-name>

OpenSDN

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node

After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>

Core

[54393] MOSK cluster status displays outdated Ceph status

On MOSK clusters with the MiraCeph-based Ceph, after restoring the Ceph cluster health, the MOSK cluster status may display outdated information that was valid some time ago, with baremetal-provider not reconciling the MOSK cluster. As a workaround, restart the baremetal-provider pod or add any annotation or label to the Cluster object, for example:

kubectl annotate cluster <NAME> -n <NAMESPACE> foo=bar

Ceph

[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment

During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]

The issue occurs due to the following upstream Ceph issues:

To verify whether the cluster is affected:

  1. Enter the ceph-tools pod:

  2. Verify the Ceph cluster status:

    • Verify Ceph health:

      ceph -s
      

      Example of a positive system response in the affected cluster:

      cluster:
        id:     6ae41eb3-262e-4da9-8847-25efed2fcaa2
        health: HEALTH_WARN
                2 OSD(s) experiencing slow operations in BlueStore
      
      services:
        mon: 3 daemons, quorum a,b,c (age 9h)
        mgr: a(active, since 9h), standbys: b
        osd: 4 osds: 4 up (since 9h), 4 in (since 9h)
        rgw: 2 daemons active (2 hosts, 1 zones)
      
      data:
        pools:   15 pools, 409 pgs
        objects: 1.67k objects, 4.6 GiB
        usage:   11 GiB used, 2.1 TiB / 2.1 TiB avail
        pgs:     409 active+clean
      
      io:
        client:   85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr
      
    • Verify Ceph health details:

      ceph health detail
      

      Example of a positive system response in the affected cluster:

      HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore
      [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore
           osd.2 observed slow operation indications in BlueStore
           osd.3 observed slow operation indications in BlueStore
      
  3. Exit the ceph-tools pod.

Workaround:

Configure the bluestore_slow_ops_warn options as follows:

kubectl -n ceph-lcm-mirantis edit miraceph
kubectl -n <moskClusterProject> edit kcc
spec:
  cephClusterSpec:
    rookConfig:
      osd|bluestore_slow_ops_warn_lifetime: "600"
      osd|bluestore_slow_ops_warn_threshold: "10"

Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.

This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.

StackLight

[48581] OpenSearchClusterStatusCritical is firing during cluster update

During update of a management or MOSK cluster with StackLight enabled in HA mode, the OpenSearchClusterStatusCritical alert may trigger when the next OpenSearch node restarts before shards from the previous node finish assigning. This can push some indices to red temporarily, making them unavailable for reads and writes, possibly causing some new logs being lost.

The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.

Bare metal

[54431] False-positive InfraConnectivityMonitor status for machine readiness

On clusters with enabled network infrastructure monitoring, the status of the InfraConnectivityMonitor object may display the false-positive ok status for readiness of all cluster machines while some of them are not yet processed by the related controller, so the amount of machines in the InfraConnectivityMonitor status is not equal to the amount of actual cluster machines that are part of infrastructure connectivity monitoring.

As a workaround, wait for some time until the number of machines in targetsConfigStatus of the InfraConnectivityMonitor object becomes equal to inventoryConfigStatus.

Container Cloud web UI

[50168] Inability to use a new project through the Container Cloud web UI

A newly created project does not display all available tabs and contains different access denied errors during first five minutes after creation.

To work around the issue, refresh the browser in five minutes after the project creation.

Cluster update

[42449] Rolling reboot failure on a Tungsten Fabric cluster

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[54416] OpenStackDeployment is stuck with APPLYING after cluster update

After cluster update is completed, the OpenStackDeployment state is stuck with the APPLYING status.

As a workaround, restart the rockoon pod.