Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly MCC). This means everything you need is in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section describes the MOSK known issues with available workarounds.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[51127] Replaced node creates a new OVN DB cluster instead of rejoining¶
After a node replacement, a new OVN database cluster is created instead of the replaced node rejoining the existing cluster, causing pod initialization issues.
If you encounter the issue, please contact Mirantis support for the workaround.
[53401] Credential rotation reports success without performing action¶
Occasionally, the password rotation procedure for admin
or service
credentials may incorrectly report success without actually initiating
the rotation process. This can result in unchanged credentials despite
the procedure indicating completion.
To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.
[54416] OpenStackDeployment is stuck with APPLYING after cluster update¶
After cluster update is completed, the OpenStackDeployment
state is stuck
with the APPLYING
status.
As a workaround, restart the rockoon
pod.
[54430] AMQP message delivery fails when message size exceeds RabbitMQ limit¶
When sending large messages, the following error might appear in logs:
oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on openstack-neutron-rabbitmq-rabbitmq-0.rabbitmq-neutron.openstack.svc.cluster.local:5672 after inf tries: Basic.publish: (406) PRECONDITION_FAILED - message size 40744975 is larger than configured max size 16777216
This error occurs because the message size (for example, 40744975 bytes) exceeds the configured RabbitMQ maximum message size (for example, 16777216 bytes), potentially causing discruption of the services relying on the AMQP message.
To work around the issue, increase the max_message_size
setting
in the OpenStackDeployment
custom rsource before updating to MOSK
25.2:
services:
networking:
rabbitmq:
values:
conf:
rabbitmq:
max_message_size: "67108864"
[54570] The rfs-openstack-redis pod gets stuck in the Completed state¶
After node reboot, the rfs-openstack-redis
pod may get stuck in the
Completed
state blocking synchronization of the Redis cluster.
As a workaround, delete the rfs-openstack-redis
pod that remains in the
Completed
state:
kubectl -n openstack-redis delete <pod-name>
OpenSDN¶
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶
After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.
To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status
Example of the system response with outdated IP addresses:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.201.144 509.43 KiB 256 ? 7e760a99-fae5-4921-b0c5-d9e6e1eca1c5 rack1
UN 192.168.50.146 534.18 KiB 256 ? 2248ea35-85d4-4887-820b-1fac4733021f rack1
UN 192.168.145.147 484.19 KiB 256 ? d988aaaa-44ae-4fec-a617-0b0a253e736d rack1
DN 192.168.145.144 481.53 KiB 256 ? c23703a1-6854-47a7-a4a2-af649d63af0c rack1
An extra node will appear in the cluster with an outdated IP address
(the IP of the terminated Cassandra pod) in the Down
state.
To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>
Core¶
[54393] MOSK cluster status displays outdated Ceph status¶
On MOSK clusters with the MiraCeph
-based Ceph, after restoring the Ceph
cluster health, the MOSK cluster status
may display outdated information
that was valid some time ago, with baremetal-provider
not reconciling the
MOSK cluster. As a workaround, restart the baremetal-provider
pod or
add any annotation or label to the Cluster
object, for example:
kubectl annotate cluster <NAME> -n <NAMESPACE> foo=bar
Ceph¶
[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment¶
During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:
Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]
The issue occurs due to the following upstream Ceph issues:
To verify whether the cluster is affected:
Enter the
ceph-tools
pod:Verify the Ceph cluster status:
Verify Ceph health:
ceph -s
Example of a positive system response in the affected cluster:
cluster: id: 6ae41eb3-262e-4da9-8847-25efed2fcaa2 health: HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore services: mon: 3 daemons, quorum a,b,c (age 9h) mgr: a(active, since 9h), standbys: b osd: 4 osds: 4 up (since 9h), 4 in (since 9h) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 15 pools, 409 pgs objects: 1.67k objects, 4.6 GiB usage: 11 GiB used, 2.1 TiB / 2.1 TiB avail pgs: 409 active+clean io: client: 85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr
Verify Ceph health details:
ceph health detail
Example of a positive system response in the affected cluster:
HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore osd.2 observed slow operation indications in BlueStore osd.3 observed slow operation indications in BlueStore
Exit the
ceph-tools
pod.
Workaround:
Configure the bluestore_slow_ops_warn
options as follows:
kubectl -n ceph-lcm-mirantis edit miraceph
kubectl -n <moskClusterProject> edit kcc
spec:
cephClusterSpec:
rookConfig:
osd|bluestore_slow_ops_warn_lifetime: "600"
osd|bluestore_slow_ops_warn_threshold: "10"
Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.
This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.
StackLight¶
[48581] OpenSearchClusterStatusCritical is firing during cluster update¶
During update of a management or MOSK cluster with StackLight enabled in HA
mode, the OpenSearchClusterStatusCritical
alert may trigger when the next
OpenSearch node restarts before shards from the previous node finish assigning.
This can push some indices to red
temporarily, making them unavailable for
reads and writes, possibly causing some new logs being lost.
The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.
Bare metal¶
[54431] False-positive InfraConnectivityMonitor status for machine readiness¶
On clusters with enabled network infrastructure monitoring, the status of the
InfraConnectivityMonitor
object may display the false-positive ok
status for readiness of all cluster machines while some of them are not yet
processed by the related controller, so the amount of machines in the
InfraConnectivityMonitor
status is not equal to the amount of actual
cluster machines that are part of infrastructure connectivity monitoring.
As a workaround, wait for some time until the number of machines in
targetsConfigStatus
of the InfraConnectivityMonitor
object becomes
equal to inventoryConfigStatus
.
Container Cloud web UI¶
[50168] Inability to use a new project through the Container Cloud web UI¶
A newly created project does not display all available tabs and contains different access denied errors during first five minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.
Cluster update¶
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
[54416] OpenStackDeployment is stuck with APPLYING after cluster update¶
After cluster update is completed, the OpenStackDeployment
state is stuck
with the APPLYING
status.
As a workaround, restart the rockoon
pod.