Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!

Starting with MOSK 25.2, the MOSK documentation set will cover all product layers, including MOSK management (formerly MCC). This means everything you need will be in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.

Known issues¶

This section lists MOSK known issues with workarounds for the MOSK release 24.3.3. For the known issues in the related Container Cloud release, refer to Mirantis Container Cloud: Release Notes.

Update known issues¶

[42449] Rolling reboot failure on a Tungsten Fabric cluster¶

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[46671] Cluster update fails with the tf-config pods crashed¶

Fixed in MOSK 25.1 Fixed in MOSK 25.1.1

When updating to the MOSK 24.3 series, tf-config pods from the Tungsten Fabric namespace may enter the CrashLoopBackOff state. For example:

tf-config-cs8zr                            2/5     CrashLoopBackOff   676 (19s ago)   15h
tf-config-db-6zxgg                         1/1     Running            44 (25m ago)    15h
tf-config-db-7k5sz                         1/1     Running            43 (23m ago)    15h
tf-config-db-dlwdv                         1/1     Running            43 (25m ago)    15h
tf-config-nw4tr                            3/5     CrashLoopBackOff   665 (43s ago)   15h
tf-config-wzf6c                            1/5     CrashLoopBackOff   680 (10s ago)   15h
tf-control-c6bnn                           3/4     Running            41 (23m ago)    13h
tf-control-gsnnp                           3/4     Running            42 (23m ago)    13h
tf-control-sj6fd                           3/4     Running            41 (23m ago)    13h

To troubleshoot the issue, check the logs inside the tf-config API container and the tf-cassandra pods. The following example logs indicate that Cassandra services failed to peer with each other and are operating independently:

Logs from the tf-config API container:

NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})

Logs from the tf-cassandra pods:

INFO  [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling
WARN  [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready

To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:

kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2

Now, all other services in the Tungsten Fabric namespace should be in the Active state.

[49078] Migration to containerd is stuck due to orphaned Docker containers¶

Fixed in MOSK 25.1 Fixed in MOSK 25.1.1

During migration of container runtime from Docker to containerd, some nodes may get stuck with the following error in LCM logs:

Orphaned Docker containers found after migration. Unable to proceed, please
check the node manually: exit status 2

The cluster is affected if orphaned containers with the k8s_ prefix are present on the affected nodes:

docker ps -a --format '{{ .Names }}' | grep '^k8s_'

Workaround:

Inspect recent Ansible logs at /var/log/lcm/* and make sure that the only failed task during migration is Delete running pods. If so, proceed to the next step. Otherwise, contact Mirantis support for further information.
Stop and remove orphaned containers with the k8s_ prefix.

Note

This action has no impact on the cluster because the nodes are already cordoned and drained as part of the maintenance window.

[49678] The Machine status is flapping after migration to containerd¶

Fixed in MOSK 25.1 Fixed in MOSK 25.1.1

On cluster machines where any HostOSConfiguration object is targeted and migration to containerd is applied, the machine status may be flapping (Configure → Ready → Configure → Ready) with the HostOSConfiguration-related Ansible tasks constantly restarting. This occurs due to the HostOSConfiguration object state items being constantly added and then removed from related LCMMachine objects.

To work around the issue, temporarily disable all HostOSConfiguration objects until the issue is resolved. The expected Container Cloud release with the issue resolution is targeted to Container Cloud 2.29.0, after the management cluster update to the Cluster release 16.4.0.

To disable HostOSConfiguration objects:

In the machineSelector:matchLabels section of every HostOSConfiguration object, remove the corresponding label selectors for cluster machines.
Wait for each HostOSConfiguration object status to be updated and the machinesStates field to be absent:
```
kubectl -n <namespace> get hoc <hoc-name> -o jsonpath='{.status.machinesStates}'
```
The system response must be empty.

Once the issue is resolved in the target release, re-enable all objects using the same procedure.

OpenStack¶

[31186,34132] Pods get stuck during MariaDB operations¶

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[42386] A load balancer service does not obtain the external IP address¶

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

Identify the service that is stuck:

kubectl get svc -A | grep pending

Example of system response:

stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP

Add an arbitrary label to the service that is stuck. For example:

kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1

Example of system response:

service/iam-proxy-prometheus labeled

Verify that the external IP was allocated to the service:

kubectl get svc -n stacklight iam-proxy-prometheus

Example of system response:

NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d

[43058] [Antelope] Cronjob for MariaDB is not created¶

Fixed in MOSK 25.1

Sometimes, after changing the OpenStackDeployment custom resource, it does not transition to the APPLYING state as expected.

To work around the issue, restart the rockoon` pod in the osh-system namespace.

[47695] Cinder database sync job fails during upgrade from Antelope to Caracal¶

Fixed in MOSK 24.3.6

Due to the issue in the OpenStack Cinder online data migration code, the migration does not process soft-deleted rows. As a result, in the presence of non-processed soft-deleted rows, the cinder-db-sync job may fail during upgrade from Antelope to Caracal with the error in the pod logs similar to the following:

2024-10-24 18:55:06.678 1 ERROR cinder pymysql.err.DataError: (1265, "Data truncated for column 'use_quota' at row 24")

The issue can occur if your MOSK cluster was initially deployed with OpenStack Wallaby or earlier release, such as Victoria, Ussuri, and so on, and you do not perform the standard periodic OpenStack database cleanup procedure. As a result, unprocessed soft-deleted database rows for volumes and snapshots created before the Xena release may still be present in the database, causing the failure of the database migration.

To verify if your cluster is affected:

Use the following SQL query against the OpenStack database:
```
SELECT COUNT(*) FROM cinder.volumes WHERE use_quota IS NULL;
SELECT COUNT(*) FROM cinder.snapshots WHERE use_quota IS NULL;
```
If both queries return a zero count, your cluster is not affected.

If either query returns a non-zero count, your cluster is affected.
Verify that all the affected rows are soft-deleted:
```
SELECT COUNT(*) FROM cinder.volumes WHERE use_quota IS NULL AND deleted=0;
SELECT COUNT(*) FROM cinder.snapshots WHERE use_quota IS NULL AND deleted=0;
```
If either query returns a non-zero count, stop and request Mirantis support.

If both queries return zero count, proceed with the workaround.

Workaround:

Manually change the value of the use_quota field to 1, where its value is NULL using the following SQL query:

UPDATE cinder.volumes SET use_quota=1 WHERE deleted=1 AND use_quota IS NULL;
UPDATE cinder.snapshots SET use_quota=1 WHERE deleted=1 AND use_quota IS NULL;

This action is generally harmless as it only modifies rows that are already soft-deleted, and would eventually be removed by the database cleanup.

If you have already encountered the issue and your OpenStack upgrade is stuck, perform the database modification as described above, then re-run the cinder-db-sync job. Once completed, the upgrade should continue as expected.

Tungsten Fabric¶

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot¶

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶

After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>

StackLight¶

[42463] KubePodsCrashLooping is firing during cluster update¶

During major or patch update of a MOSK cluster with StackLight enabled in non-HA mode, the KubePodsCrashLooping alert may be firing for the Grafana ReplicaSet.

Grafana relies on PostgreSQL for persistent data. In non-HA StackLight setup, PostgreSQL becomes temporarily unavailable during updates. If Grafana loses its database connection or fails to establish one during startup, Grafana fails with an error. This may cause the Grafana pod to enter the CrashLoopBackOff state. Such behavior is expected in non-HA StackLight setups. The Grafana pod will resume normal operation after PostgreSQL is restored.

To prevent the issue, deploy StackLight in HA mode.

[48581] OpenSearchClusterStatusCritical is firing during cluster update¶

During update of a management or MOSK cluster with StackLight enabled in HA mode, the OpenSearchClusterStatusCritical alert may trigger when the next OpenSearch node restarts before shards from the previous node finish assigning. This can push some indices to red temporarily, making them unavailable for reads and writes, possibly causing some new logs being lost.

The issue does not affect the cluster during the update, no workaround is needed, and you can safely ignore it.

[49340] Tag-based filtering does not work for output_kind: audit¶

Fixed in MOSK 25.1 Fixed in MOSK 25.1.1

Tag-based filtering of logs using the tag_include parameter does not work for the logging.externalOutputs feature when output_kind: audit is selected.

For example, if the user wants to send only logs from the sudo program and sets tag_include: sudo, none of the logs will be sent to an external destination.

To work around the issue, allow forwarding of all audit logs in addition to sudo, which include logs from sshd, systemd-logind, and su. Instead of tag_include: sudo, specify tag_include: '{sudo,systemd-audit}'.

When the fix applies in MOSK 25.1, filtering starts working automatically.

[51524] sf-notifier creates big amount of relogins to Salesforce¶

Fixed in MOSK 24.3.5

The incompatibility between the newly implemented session refresh in the upstream simple-salesforce with the MOSK implementation of session refresh in sf-notifier results in the uncontrolled growth of new logins and lack of session reuse. The issue applies to both MOSK and management clusters.

Workaround:

The workaround to the issue is to change the sf-notifier image tag directly in the Deployment object. This change is not persistent as this direct change in the Deployment object will be reverted or overridden by:

Container Cloud version update (for management clusters)
Cluster release version update (for MOSK cluster)
Any sf-notifier-related operation (for all clusters):
- Disable and enable
- Credentials change
- IDs change
- Any configuration change for resources, node selector, tolerations, and log level

Once applied, this workaround must be re-applied whenever one of the above operations is performed in the cluster.

Print the currently used image:

kubectl get deployment sf-notifier -n stacklight -o jsonpath="{.spec.template.spec.containers[0].image}"

Possible results:

mirantis.azurecr.io/stacklight/sf-notifier:v0.4-20250113023013

127.0.0.1:44301/stacklight/sf-notifier:v0.4-20250113023013

Compare the sf-notifier image tag with the list of affected tags. If the image is affected, it has to be replaced. Otherwise, your cluster is not affected.

Affected tags:

v0.4-20241021023015
v0.4-20241028023015
v0.4-20241118023015
v0.4-20241216023012
v0.4-20250113023013
v0.4-20250217023014
v0.4-20250317092322
v0.4-20250414023016

In the resulting string, replace only the tag of the affected image with the desired v0.4-20240828023015 tag. Keep the registry the same as in the original Deployment object.

Resulting images from examples:
```
mirantis.azurecr.io/stacklight/sf-notifier:v0.4-20240828023015
```
or
```
127.0.0.1:44301/stacklight/sf-notifier:v0.4-20240828023015
```

Update the sf-notifier Deployment with the new image:

kubectl set image deployment/sf-notifier sf-notifier=<new image> -n stacklight

For example:

kubectl set image deployment/sf-notifier sf-notifier=mirantis.azurecr.io/stacklight/sf-notifier:v0.4-20240828023015 -n stacklight

kubectl set image deployment/sf-notifier sf-notifier=127.0.0.1:44301/stacklight/sf-notifier:v0.4-20240828023015 -n stacklight

Wait until the pod with the updated image is created, and check the logs. Verify that there are no errors in the logs:
```
kubectl logs pod/<sf-notifier pod> -n stacklight
```

As this change is not persistent and can be reverted by the cluster update operation or any operation related to sf-notifier, periodically check all clusters and if the change has been reverted, re-apply the workaround.

Optionally, you can add a custom alert that will monitor the current tag of the sf-notifier image and will fire the alert if the tag is present in the list of affected tags. For the custom alert configuration details, refer to Alert configuration.

Example of a custom alert to monitor the current tag of the sf-notifier image:

- name: stacklight
  values:
    ...
    prometheusServer:
      ...
      customAlerts:
        ...
      - alert: SFnotifierImageVersion
        annotations:
          description: "sf-notifier image has a buggy tag, please revert deployment image back to sf-notifier:v0.4-20240828023015. This should be fixed in MCC 2.29.3 / MOSK 24.3.5, remove the alert after this release upgrade."
          summary: "This image might be causing too many API logins and exceeding our monthly API budget, please act immediately"
        expr: >-
          avg(kube_pod_container_info{container="sf-notifier",image_spec=~".*v0.4-(20241021023015|20241028023015|20241118023015|20241216023012|20250113023013|20250217023014|20250317092322|20250414023016)"})
        for: 5m
        labels:
          service: alertmanager
          severity: critical

Container Cloud web UI¶

[53886] Moving a worker machine into maintainance fails¶

After moving a cluster into maintenance mode using the MOSK management console, an attempt to move a worker machine into maintenance mode fails with the Only undefined workers can be in maintainance mode at the same time.

As a work around, move worker machines into maintenance mode using CLI as described in Enable maintenance mode on a cluster and machine using CLI.

[50168] Inability to use a new project through the Container Cloud web UI¶

A newly created project does not display all available tabs and contains different access denied errors during first five minutes after creation.

To work around the issue, refresh the browser in five minutes after the project creation.

[50181] Failure to deploy a compact cluster using the Container Cloud web UI¶

A compact MOSK cluster fails to be deployed through the Container Cloud web UI due to inability to add any label to the control plane machines along with inability to change dedicatedControlPlane: false using the web UI.

To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.