Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section lists MOSK known issues with workarounds for the MOSK release 25.2.1.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysqldirectory on themariadb-serverPod.Verify that other replicas are up and ready.
Remove the
galera.cachefile for the affectedmariadb-serverPod.Remove the affected
mariadb-serverPod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeledVerify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[53401] Credential rotation reports success without performing action¶
Occasionally, the password rotation procedure for admin or service
credentials may incorrectly report success without actually initiating
the rotation process. This can result in unchanged credentials despite
the procedure indicating completion.
To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.
[54416] OpenStackDeployment is stuck with APPLYING after cluster update¶
After cluster update is completed, the OpenStackDeployment state is stuck
with the APPLYING status.
As a workaround, restart the rockoon pod.
[54430] AMQP message delivery fails when message size exceeds RabbitMQ limit¶
When sending large messages, the following error might appear in logs:
oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on openstack-neutron-rabbitmq-rabbitmq-0.rabbitmq-neutron.openstack.svc.cluster.local:5672 after inf tries: Basic.publish: (406) PRECONDITION_FAILED - message size 40744975 is larger than configured max size 16777216
This error occurs because the message size (for example, 40744975 bytes) exceeds the configured RabbitMQ maximum message size (for example, 16777216 bytes), potentially causing discruption of the services relying on the AMQP message.
To work around the issue, increase the max_message_size setting
in the OpenStackDeployment custom rsource before updating to MOSK
25.2:
services:
networking:
rabbitmq:
values:
conf:
rabbitmq:
max_message_size: "67108864"
[54570] The rfs-openstack-redis pod gets stuck in the Completed state¶
After node reboot, the rfs-openstack-redis pod may get stuck in the
Completed state blocking synchronization of the Redis cluster.
As a workaround, delete the rfs-openstack-redis pod that remains in the
Completed state:
kubectl -n openstack-redis delete <pod-name>
OpenSDN¶
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config or
analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable
the tracking_records_in_ets during the initialization process.
To work around the issue, restart the affected pods manually.
[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶
After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.
To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status
Example of the system response with outdated IP addresses:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.201.144 509.43 KiB 256 ? 7e760a99-fae5-4921-b0c5-d9e6e1eca1c5 rack1
UN 192.168.50.146 534.18 KiB 256 ? 2248ea35-85d4-4887-820b-1fac4733021f rack1
UN 192.168.145.147 484.19 KiB 256 ? d988aaaa-44ae-4fec-a617-0b0a253e736d rack1
DN 192.168.145.144 481.53 KiB 256 ? c23703a1-6854-47a7-a4a2-af649d63af0c rack1
An extra node will appear in the cluster with an outdated IP address
(the IP of the terminated Cassandra pod) in the Down state.
To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>
[47396] Exceeding the number of Cassandra tombstone records leads to unresponsive TF API (tf-config pods)¶
The issue occurs during HA operations such as hard node reboot and force node
shutdown, when tf-api is under load.
To work around the issue, manually trigger garbage collection and compaction as follows:
Reduce the garbage collection grace period:
ALTER TABLE config_db_uuid.obj_uuid_table WITH gc_grace_seconds = 10;
After some time (10+ seconds), run the following commands to force deletion of tombstones and compact the database:
nodetool garbagecollect -g CELL nodetool compact -s
Restore default
gc_grace_seconds = 864000to avoid potential performance issues.
Ceph¶
[54195] Ceph OSD experiencing slow operations in BlueStore during MOSK deployment¶
During MOSK cluster deployment, the following false-positive example alert for Ceph may raise:
Failed to configure Ceph cluster: ceph cluster verification is failed:
[BLUESTORE_SLOW_OP_ALERT: 3 OSD(s) experiencing slow operations in BlueStore]
The issue occurs due to the following upstream Ceph issues:
To verify whether the cluster is affected:
Enter the
ceph-toolspod:Verify the Ceph cluster status:
Verify Ceph health:
ceph -sExample of a positive system response in the affected cluster:
cluster: id: 6ae41eb3-262e-4da9-8847-25efed2fcaa2 health: HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore services: mon: 3 daemons, quorum a,b,c (age 9h) mgr: a(active, since 9h), standbys: b osd: 4 osds: 4 up (since 9h), 4 in (since 9h) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 15 pools, 409 pgs objects: 1.67k objects, 4.6 GiB usage: 11 GiB used, 2.1 TiB / 2.1 TiB avail pgs: 409 active+clean io: client: 85 B/s rd, 500 KiB/s wr, 0 op/s rd, 27 op/s wr
Verify Ceph health details:
ceph health detail
Example of a positive system response in the affected cluster:
HEALTH_WARN 2 OSD(s) experiencing slow operations in BlueStore [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore osd.2 observed slow operation indications in BlueStore osd.3 observed slow operation indications in BlueStore
Exit the
ceph-toolspod.
Workaround:
Configure the bluestore_slow_ops_warn options as follows:
kubectl -n ceph-lcm-mirantis edit miraceph
kubectl -n <moskClusterProject> edit kcc
spec:
cephClusterSpec:
rookConfig:
osd|bluestore_slow_ops_warn_lifetime: "600"
osd|bluestore_slow_ops_warn_threshold: "10"
Wait for up to five minutes for the change to apply and the alert to disappear during cluster deployment.
This configuration triggers the alert only if at least 10 BlueStore slow operations occur during last 10 minutes. If triggered, it indicates a potential hardware disk issue on the BlueStore host that must be verified and reconfigured accordingly.
StackLight¶
[48581] OpenSearchClusterStatusCritical is firing during cluster update¶
During update of a management or MOSK cluster with StackLight enabled in HA
mode, the OpenSearchClusterStatusCritical alert may trigger when the next
OpenSearch node restarts before shards from the previous node finish assigning.
This can push some indices to red temporarily, making them unavailable for
reads and writes, possibly causing some new logs being lost.
The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.
Bare metal¶
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
TerminatingstateA new
ironicPod fails to startThe related bare metal host is stuck in the
deprovisioningstate
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[54431] False-positive InfraConnectivityMonitor status for machine readiness¶
On clusters with enabled network infrastructure monitoring, the status of the
InfraConnectivityMonitor object may display the false-positive ok
status for readiness of all cluster machines while some of them are not yet
processed by the related controller, so the amount of machines in the
InfraConnectivityMonitor status is not equal to the amount of actual
cluster machines that are part of infrastructure connectivity monitoring.
As a workaround, wait for some time until the number of machines in
targetsConfigStatus of the InfraConnectivityMonitor object becomes
equal to inventoryConfigStatus.
[54981] Management or MOSK cluster update is stuck due to the invalid BareMetalHostProfile spec¶
On clusters that were originally deployed using MOSK management releases
earlier than 2.26.0 (Cluster release 17.0.x or earlier), on which the
deprecated fields minSizeGiB, maxSizeGiB, or sizeGiB were not
migrated to minSize, maxSize, and size accordingly, the management
or MOSK cluster update may get stuck after the lcm-agent upgrade. For
details on the deprecated fields, see Bare metal deprecation notes.
Symptoms:
LCM Agent is upgraded on the management cluster nodes to version
1.43.5.The update is stuck and nodes are not moving to
prepareanddeployphases.In the
baremetal-providerlogs, the following error message is present:{"level":"error","ts":"...","logger":"bm.manager","caller":"..." "msg":"Reconciler error","controller":"machine-controller","object":"default/master-0" "namespace":"default","name":"master-0","reconcileID":"..." "error":"failed to build AnsibleExtra Spec for Machine 'default/master-0' from BareMetalHost 'default/master-0' HardwareDetails matching BareMetalHostProfile 'default/region-one-default': invalid BareMetalHostProfile Spec: zero partition size is allowed for the last partition only. dev:...}
Workaround:
Apply the following script on the management and MOSK clusters containing deprecated sizing fields to unblock cluster update:
Click the
migrate_bmhp.pylink below to download the script:Export the
kubeconfigenvironment variable for the management cluster:export KUBECONFIG=~/.kube/kubeconfig-mgmt
Create a backup of the
BareMetalHostProfileobjects:# For the management cluster BareMetalHostProfile objects: kubectl get bmhp -o yaml > bmhp-backup-mgmt.yml # For the MOSK cluster BareMetalHostProfile objects: kubectl -n <moskClusterNamespace> get bmhp -o yaml > bmhp-backup-mosk.yml
Re-add the deprecated values to the
baremetalhostprofiles.metal3.ioCRDs:Recover the deprecated fields:
migrate_bmhp.py add-crdVerify whether the deprecated fields were recovered:
# For the management cluster BareMetalHostProfile objects: migrate_bmhp.py test-migrate default # For the MOSK cluster BareMetalHostProfile objects: migrate_bmhp.py test-migrate <moskClusterNamespace>
If the system response contains patches for recovery, proceed to the following substep. Otherwise, proceed to step 5.
Create a backup of objects with deprecated fields:
# For the management cluster: kubectl get bmhp -o yaml > bmhp-backup-mgmt-depr-fields.yml # For the MOSK cluster: kubectl -n <moskClusterNamespace> get bmhp -o yaml > bmhp-backup-mosk-depr-fields.yml
Migrate the deprecated fields:
# For the management cluster: migrate_bmhp.py migrate default # For the MOSK cluster: migrate_bmhp.py migrate <moskClusterNamespace>
Delete the deprecated fields from the affected CRDs:
migrate_bmhp.py del-crd
If the recovery described in the previous step did not work, verify whether the
BareMetalHostProfileobjects contain thekubectl.kubernetes.io/last-applied-configurationannotation. And if so, attempt to recover the deprecated values using this annotation:Review the patches to be recovered:
# For the management cluster: migrate_bmhp.py test-last default # For the MOSK cluster: migrate_bmhp.py test-last <moskClusterNamespace>
If the system response is empty, contact Mirantis support for further instructions.
Recover the deprecated values:
# For the management cluster: migrate_bmhp.py recover-last default # For the MOSK cluster: migrate_bmhp.py recover-last <moskClusterNamespace>
If none of the steps above works, contact Mirantis support for further instructions.
LCM¶
[7947] Docker panic causes its service restarts every 24 hours¶
On manager nodes of MOSK and MOSK management clusters, the Docker service fails with panic every 24 hours due to errors related to sending of telemetry statistics. All affected manager nodes have MCR 25.0.12.
Symptoms:
The Docker service, Docker containers, and MKE-related services are restarted every 24 hours on manager nodes after the initial Docker service start.
The following error is present in Docker logs causing restarts:
Oct 12 21:20:11 kaas-node-<id> dockerd[...]: created by github.com/docker/docker/internal/mirantis/telemetry.(*Telemetry).start in goroutine 1 Oct 12 21:20:11 kaas-node-<id> dockerd[...]: /root/build-deb/engine/internal/mirantis/telemetry/telemetry.go:99 +0xb1
Workaround:
Apply the following procedure on both MOSK and MOSK management clusters:
Log in to the MKE web UI as
admin.Navigate to Admin settings and select Usage in the left sidebar menu.
Use the toggle to disable hourly usage reporting and API and UI tracking. Save the settings.
Once done, the Docker daemon configuration
/etc/docker/daemon.jsonon manager nodes will be updated with the"features":{"telemetry":false}option, which disables sending of telemetry statistics.Apply changes to Docker daemon by restarting the Docker service on each manager node one by one:
systemctl restart docker
Warning
Do not restart Docker daemon on all machines simultaneously. Ensure that all services in the output of docker service ls is healthy before proceeding to the next manager node.
MOSK management console¶
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the MOSK
management console and contains different access denied errors during first
five minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.
Cluster update¶
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
[54416] OpenStackDeployment is stuck with APPLYING after cluster update¶
After cluster update is completed, the OpenStackDeployment state is stuck
with the APPLYING status.
As a workaround, restart the rockoon pod.