Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!

Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.

Known issues¶

This section describes the MOSK known issues with available workarounds.

OpenStack¶

[31186,34132] Pods get stuck during MariaDB operations¶

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[42386] A load balancer service does not obtain the external IP address¶

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

Identify the service that is stuck:

kubectl get svc -A | grep pending

Example of system response:

stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP

Add an arbitrary label to the service that is stuck. For example:

kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1

Example of system response:

service/iam-proxy-prometheus labeled

Verify that the external IP was allocated to the service:

kubectl get svc -n stacklight iam-proxy-prometheus

Example of system response:

NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d

[51127] Replaced node creates a new OVN DB cluster instead of rejoining¶

Fixed in MOSK 25.2.1

After a node replacement, a new OVN database cluster is created instead of the replaced node rejoining the existing cluster, causing pod initialization issues.

If you encounter the issue, please contact Mirantis support for the workaround.

[53401] Credential rotation reports success without performing action¶

Occasionally, the password rotation procedure for admin or service credentials may incorrectly report success without actually initiating the rotation process. This can result in unchanged credentials despite the procedure indicating completion.

To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.

[54430] AMQP message delivery fails when message size exceeds RabbitMQ limit¶

When sending large messages, the following error might appear in logs:

oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on openstack-neutron-rabbitmq-rabbitmq-0.rabbitmq-neutron.openstack.svc.cluster.local:5672 after inf tries: Basic.publish: (406) PRECONDITION_FAILED - message size 40744975 is larger than configured max size 16777216

This error occurs because the message size (for example, 40744975 bytes) exceeds the configured RabbitMQ maximum message size (for example, 16777216 bytes), potentially causing discruption of the services relying on the AMQP message.

To work around the issue, increase the max_message_size setting in the OpenStackDeployment custom rsource before updating to MOSK 25.2:

services:
  networking:
    rabbitmq:
      values:
        conf:
          rabbitmq:
            max_message_size: "67108864"

[54570] The rfs-openstack-redis pod gets stuck in the Completed state¶

After node reboot, the rfs-openstack-redis pod may get stuck in the Completed state blocking synchronization of the Redis cluster.

As a workaround, delete the rfs-openstack-redis pod that remains in the Completed state:

kubectl -n openstack-redis delete <pod-name>

[55768] OVN load balancer for monitoring services becomes unreachable¶

Fixed in MOSK 25.2.2

During an update, the OVN (Open Virtual Network) load balancer for internal monitoring services, such as Grafana and kube-api, may become unreachable when accessed through curl.

The issue is caused by the fact that in a multi-tier ACL setup, sessions may not be committed to conntrack if the first packet is processed on an ACL chain that ends with action=pass. As a result, traffic replies might not be handled correctly in the ingress or egress port’s conntrack zone.

To work around the issue, we recommend updating to MOSK 25.2.2. If it is not possible, contact Mirantis support.

OpenSDN¶

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot¶

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the issue, restart the affected pods manually.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶

After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>

[47396] Exceeding the number of Cassandra tombstone records leads to unresponsive TF API (tf-config pods)¶

Note

The description of this issue will be moved to the Troubleshooting Guide in the following release.

The issue occurs during HA operations such as hard node reboot and force node shutdown, when tf-api is under load.

To work around the issue, manually trigger garbage collection and compaction as follows:

Reduce the garbage collection grace period:

ALTER TABLE config_db_uuid.obj_uuid_table WITH gc_grace_seconds = 10;

After some time (10+ seconds), run the following commands to force deletion of tombstones and compact the database:
```
nodetool garbagecollect -g CELL
nodetool compact -s
```
Restore default gc_grace_seconds = 864000 to avoid potential performance issues.

Core¶

[54393] MOSK cluster status displays outdated Ceph status¶

Fixed in MOSK 25.2.1

On MOSK clusters with the MiraCeph-based Ceph, after restoring the Ceph cluster health, the MOSK cluster status may display outdated information that was valid some time ago, with baremetal-provider not reconciling the MOSK cluster. As a workaround, restart the baremetal-provider pod or add any annotation or label to the Cluster object, for example:

kubectl annotate cluster <NAME> -n <NAMESPACE> foo=bar

Ceph¶

[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment¶

During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]

The issue occurs due to the following upstream Ceph issues:

To verify whether the cluster is affected:

Enter the ceph-tools pod:

Verify the Ceph cluster status:

Verify Ceph health:

ceph -s

Example of a positive system response in the affected cluster:

cluster:
  id:     6ae41eb3-262e-4da9-8847-25efed2fcaa2
  health: HEALTH_WARN
          2 OSD(s) experiencing slow operations in BlueStore

services:
  mon: 3 daemons, quorum a,b,c (age 9h)
  mgr: a(active, since 9h), standbys: b
  osd: 4 osds: 4 up (since 9h), 4 in (since 9h)
  rgw: 2 daemons active (2 hosts, 1 zones)

data:
  pools:   15 pools, 409 pgs
  objects: 1.67k objects, 4.6 GiB
  usage:   11 GiB used, 2.1 TiB / 2.1 TiB avail
  pgs:     409 active+clean

io:
  client:   85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr

Verify Ceph health details:

ceph health detail

Example of a positive system response in the affected cluster:

HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore
[WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore
     osd.2 observed slow operation indications in BlueStore
     osd.3 observed slow operation indications in BlueStore

Exit the ceph-tools pod.

Workaround:

Configure the bluestore_slow_ops_warn options as follows:

MiraCeph-based deployment

kubectl -n ceph-lcm-mirantis edit miraceph

KaasCephCluster-based deployment

kubectl -n <moskClusterProject> edit kcc

spec:
  cephClusterSpec:
    rookConfig:
      osd|bluestore_slow_ops_warn_lifetime: "600"
      osd|bluestore_slow_ops_warn_threshold: "10"

Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.

This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.

StackLight¶

[48581] OpenSearchClusterStatusCritical is firing during cluster update¶

During update of a management or MOSK cluster with StackLight enabled in HA mode, the OpenSearchClusterStatusCritical alert may trigger when the next OpenSearch node restarts before shards from the previous node finish assigning. This can push some indices to red temporarily, making them unavailable for reads and writes, possibly causing some new logs being lost.

The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.

Bare metal¶

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[54431] False-positive InfraConnectivityMonitor status for machine readiness¶

On clusters with enabled network infrastructure monitoring, the status of the InfraConnectivityMonitor object may display the false-positive ok status for readiness of all cluster machines while some of them are not yet processed by the related controller, so the amount of machines in the InfraConnectivityMonitor status is not equal to the amount of actual cluster machines that are part of infrastructure connectivity monitoring.

As a workaround, wait for some time until the number of machines in targetsConfigStatus of the InfraConnectivityMonitor object becomes equal to inventoryConfigStatus.

[54944] Management cluster update may get stuck during host OS upgrade¶

Management cluster update to 2.30.0 may get stuck during the execution of Ansible scripts for the host operating system upgrade from Ubuntu 22.04 (Jammy) to Ubuntu 24.04 (Noble).

This issue can occur on hosts where new /usr directories scheme has not been applied. For details on /usr directories, see UsrMerge.

In a valid scheme, the /{bin,sbin,lib}/ directories are symbolic links to their counterparts under /usr/{bin,sbin,lib}/.

Symptoms:

Verify if the machine got stuck in the Deploy stage:

kubectl get machine master-0 -o wide

NAME       READY   LCMPHASE   NODENAME   UPGRADEINDEX   REBOOTREQUIRED   WARNINGS
master-0   false   Deploy     master-0   1              false            LCM Status of the machine is Deploy.

kubectl  get lcmmachine -o wide

NAME       CLUSTERNAME   TYPE      STATE     INTERNALIP      HOSTNAME   AGENTVERSION         REBOOTREQUIRED
master-0   kaas-mgmt     control   Deploy    192.168.9.101   master-0   1.43.5

SSH into that machine.

In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the latest Ansible deployment logs contain the following errors indicating that the machine is affected:

dpkg-divert: error: cannot divert directories

Use --help for help about diverting files.
dpkg: error processing archive /tmp/apt-dpkg-install-pky7pH/6-libc6_2.39-0ubuntu8.5_amd64.deb (--unpack):
 new libc6:amd64 package pre-installation script subprocess returned error exit status 2
Errors were encountered while processing:
/tmp/apt-dpkg-install-pky7pH/6-libc6_2.39-0ubuntu8.5_amd64.deb

Verify the directories structure under /. If /{bin,sbin,lib}/ are directories and not symbolic links to /usr/{bin,sbin,lib}/, the cluster is affected.

ls -la /
drwxr-xr-x    2 root root  4096 Sep 30 12:14 bin
...
drwxr-xr-x   20 root root  4096 Sep 30  2024 lib
drwxr-xr-x    2 root root  4096 Sep 30 10:20 lib64
...
drwxr-xr-x    2 root root 12288 Sep 30 10:20 sbin

Related Ubuntu issues:

Workaround:

SSH into the affected machine.
Download and install the usrmerge package:
```
dpkg -i usrmerge.deb
```
Fix broken packages:
```
apt-get install -f
```

Validate the directories structure under /. The /{bin,sbin,lib}/ directories must become symbolic links to /usr/{bin,sbin,lib}/.

lrwxrwxrwx    1 root root     7 Sep 30 12:42 bin -> usr/bin
...
lrwxrwxrwx    1 root root     7 Sep 30 12:42 lib -> usr/lib
lrwxrwxrwx    1 root root     9 Sep 30 12:42 lib32 -> usr/lib32
lrwxrwxrwx    1 root root     9 Sep 30 12:42 lib64 -> usr/lib64
lrwxrwxrwx    1 root root    10 Sep 30 12:42 libx32 -> usr/libx32
...
lrwxrwxrwx    1 root root     8 Sep 30 12:42 sbin -> usr/sbin

Verify remaining management cluster nodes. If the directories structure under / is invalid, apply the same workaround.

Once done, the machine update will continue.

[54981] Management or MOSK cluster update is stuck due to the invalid BareMetalHostProfile spec¶

On clusters that were originally deployed using MOSK management releases earlier than 2.26.0 (Cluster release 17.0.x or earlier), on which the deprecated fields minSizeGiB, maxSizeGiB, or sizeGiB were not migrated to minSize, maxSize, and size accordingly, the management or MOSK cluster update may get stuck after the lcm-agent upgrade. For details on the deprecated fields, see Bare metal deprecation notes.

Symptoms:

LCM Agent is upgraded on the management cluster nodes to version 1.43.5.
The update is stuck and nodes are not moving to prepare and deploy phases.

In the baremetal-provider logs, the following error message is present:

{"level":"error","ts":"...","logger":"bm.manager","caller":"..."
"msg":"Reconciler error","controller":"machine-controller","object":"default/master-0"
"namespace":"default","name":"master-0","reconcileID":"..."
"error":"failed to build AnsibleExtra Spec for Machine 'default/master-0' from BareMetalHost
'default/master-0' HardwareDetails matching BareMetalHostProfile 'default/region-one-default':
invalid BareMetalHostProfile Spec: zero partition size is allowed for the last partition only. dev:...}

Workaround:

Apply the following script on the management and MOSK clusters containing deprecated sizing fields to unblock cluster update:

Click the migrate_bmhp.py link below to download the script:

migrate_bmhp.py
Export the kubeconfig environment variable for the management cluster:
```
export KUBECONFIG=~/.kube/kubeconfig-mgmt
```

Create a backup of the BareMetalHostProfile objects:

# For the management cluster BareMetalHostProfile objects:
kubectl get bmhp -o yaml > bmhp-backup-mgmt.yml

# For the MOSK cluster BareMetalHostProfile objects:
kubectl -n <moskClusterNamespace> get bmhp -o yaml > bmhp-backup-mosk.yml

Re-add the deprecated values to the baremetalhostprofiles.metal3.io CRDs:

Recover the deprecated fields:
```
migrate_bmhp.py add-crd
```

Verify whether the deprecated fields were recovered:

# For the management cluster BareMetalHostProfile objects:
migrate_bmhp.py test-migrate default

# For the MOSK cluster BareMetalHostProfile objects:
migrate_bmhp.py test-migrate <moskClusterNamespace>

If the system response contains patches for recovery, proceed to the following substep. Otherwise, proceed to step 5.

Create a backup of objects with deprecated fields:

# For the management cluster:
kubectl get bmhp -o yaml > bmhp-backup-mgmt-depr-fields.yml

# For the MOSK cluster:
kubectl -n <moskClusterNamespace> get bmhp -o yaml > bmhp-backup-mosk-depr-fields.yml

Migrate the deprecated fields:

# For the management cluster:
migrate_bmhp.py migrate default

# For the MOSK cluster:
migrate_bmhp.py migrate <moskClusterNamespace>

Delete the deprecated fields from the affected CRDs:
```
migrate_bmhp.py del-crd
```

If the recovery described in the previous step did not work, verify whether the BareMetalHostProfile objects contain the kubectl.kubernetes.io/last-applied-configuration annotation. And if so, attempt to recover the deprecated values using this annotation:
1. Review the patches to be recovered:
```
# For the management cluster:
migrate_bmhp.py test-last default

# For the MOSK cluster:
migrate_bmhp.py test-last <moskClusterNamespace>
```
  If the system response is empty, contact Mirantis support for further instructions.
2. Recover the deprecated values:
```
# For the management cluster:
migrate_bmhp.py recover-last default

# For the MOSK cluster:
migrate_bmhp.py recover-last <moskClusterNamespace>
```
If none of the steps above works, contact Mirantis support for further instructions.

LCM¶

[7947] Docker panic causes its service restarts every 24 hours¶

Fixed in MOSK 25.2.2

On manager nodes of MOSK and MOSK management clusters, the Docker service fails with panic every 24 hours due to errors related to sending of telemetry statistics. All affected manager nodes have MCR 25.0.12.

Symptoms:

The Docker service, Docker containers, and MKE-related services are restarted every 24 hours on manager nodes after the initial Docker service start.

The following error is present in Docker logs causing restarts:

Oct 12 21:20:11 kaas-node-<id> dockerd[...]: created by github.com/docker/docker/internal/mirantis/telemetry.(*Telemetry).start in goroutine 1
Oct 12 21:20:11 kaas-node-<id> dockerd[...]:         /root/build-deb/engine/internal/mirantis/telemetry/telemetry.go:99 +0xb1

Workaround:

Apply the following procedure on both MOSK and MOSK management clusters:

Log in to the MKE web UI as admin.
Navigate to Admin settings and select Usage in the left sidebar menu.
Use the toggle to disable hourly usage reporting and API and UI tracking. Save the settings.

Once done, the Docker daemon configuration /etc/docker/daemon.json on manager nodes will be updated with the "features":{"telemetry":false} option, which disables sending of telemetry statistics.
Apply changes to Docker daemon by restarting the Docker service on each manager node one by one:
```
systemctl restart docker
```
Warning

Do not restart Docker daemon on all machines simultaneously. Ensure that all services in the output of docker service ls is healthy before proceeding to the next manager node.

MOSK management console¶

[50168] Inability to use a new project right after creation¶

A newly created project does not display all available tabs in the MOSK management console and contains different access denied errors during first five minutes after creation.

To work around the issue, refresh the browser in five minutes after the project creation.

Cluster update¶

[42449] Rolling reboot failure on a Tungsten Fabric cluster¶

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[54416] OpenStackDeployment is stuck with APPLYING after cluster update¶

After cluster update is completed, the OpenStackDeployment state is stuck with the APPLYING status.

As a workaround, restart the rockoon pod.