Known issues¶

This section lists MOSK known issues with workarounds for the MOSK release 24.1.7.

OpenStack¶

[31186,34132] Pods get stuck during MariaDB operations¶

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[36524] etcd enters a panic state after replacement of the controller node¶

Fixed in MOSK 24.2

After provisioning the controller node, the etcd pod initiates before the Kubernetes networking is fully operational. As a result, the pod encounters difficulties resolving DNS and establishing connections with other members, ultimately leading to a panic state for the etcd service.

Workaround:

Delete the PVC related to the replaced controller node:
```
kubectl -n openstack delete pvc <PVC-NAME>
```
Delete pods related to the crashing etcd service on the replaced controller node:
```
kubectl -n openstack delete pods <ETCD-POD-NAME>
```

[42386] A load balancer service does not obtain the external IP address¶

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

Identify the service that is stuck:

kubectl get svc -A | grep pending

Example of system response:

stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP

Add an arbitrary label to the service that is stuck. For example:

kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1

Example of system response:

service/iam-proxy-prometheus labeled

Verify that the external IP was allocated to the service:

kubectl get svc -n stacklight iam-proxy-prometheus

Example of system response:

NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d

[43058] [Antelope] Cronjob for MariaDB is not created¶

Fixed in MOSK 25.1

Sometimes, after changing the OpenStackDeployment custom resource, it does not transition to the APPLYING state as expected.

To work around the issue, restart the rockoon` pod in the osh-system namespace.

[44813] [Antelope] Traffic disruption observed on trunk ports¶

Fixed in MOSK 24.2.1 Fixed in MOSK 24.3

After upgrading to OpenStack Antelope, clusters with configured trunk ports experience traffic flow disruptions that block the cluster updates.

To work around the issue, pin the MOSK Networking service (OpenStack Neutron) container image by adding the following content to the OpenStackDeployment custom resource:

spec:
  services:
    networking:
      neutron:
        values:
          images:
            tags:
              neutron_openvswitch_agent: mirantis.azurecr.io/openstack/neutron:antelope-jammy-20240816113600

Caution

Remove the pinning after updating to MOSK 24.2.1 or later patch or major release.

[45879] [Antelope] Incorrect packet handling between instance and its gateway¶

Fixed in MOSK 24.2.1

After upgrade to OpenStack Antelope, the virtual machines experience connectivity disruptions when sending data over the virtual networks. Network packets with full MTU are dropped.

The issue affects the MOSK clusters with Open vSwitch as the networking backend and with the following specific MTU settings:

The MTU configured on the tunnel interface of compute nodes is equal to the value of the spec:services:networking:neutron:values:conf:neutron:DEFAULT:global_physnet_mtu parameter of the OpenStackDeployment custom resource (if not specified, default is 1500 bytes).

If the MTU of the tunnel interface is higher by at least 4 bytes, the cluster is not affected by the issue.
The cluster contains virtual machines that have the MTU of the network interfaces of the guest operating system larger than the MTU of the value of the global_physnet_mtu parameter above minus 50 bytes.

To work around the issue, pin the MOSK Networking service (OpenStack Neutron) container image by adding the following content to the OpenStackDeployment custom resource:

spec:
  services:
    networking:
      neutron:
        values:
          images:
            tags:
              neutron_openvswitch_agent: mirantis.azurecr.io/openstack/neutron:antelope-jammy-20240816113600

Caution

Remove the pinning after updating to MOSK 24.2.1 or later patch or major release.

Tungsten Fabric¶

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot¶

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶

After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>

Ceph¶

[42903] Inconsistent handling of missing pools by ceph-controller¶

Fixed in MOSK 24.2

In rare cases, when ceph-controller cannot confirm the existence of MOSK pools, instead of denying action and raising errors, it proceeds to recreate the Cinder Ceph client. Such behavior may potentially cause issues with OpenStack workloads.

Workaround:

In spec.cephClusterSpec of the KaaSCephCluster custom resource, remove the external section.
Wait for the Not all mgrs are running: 1/2 message to disappear from the KaaSCephCluster status.

Verify that the nova Ceph client that is integrated to MOSK has the same keyring as in the Ceph cluster.

Verify that the cinder Ceph client integrated to MOSK has the same keyring as in the Ceph cluster:

Verify that the glance Ceph client integrated to MOSK has the same keyring as in the Ceph cluster.

StackLight¶

[42463] KubePodsCrashLooping is firing during cluster update¶

During major or patch update of a MOSK cluster with StackLight enabled in non-HA mode, the KubePodsCrashLooping alert may be firing for the Grafana ReplicaSet.

Grafana relies on PostgreSQL for persistent data. In non-HA StackLight setup, PostgreSQL becomes temporarily unavailable during updates. If Grafana loses its database connection or fails to establish one during startup, Grafana fails with an error. This may cause the Grafana pod to enter the CrashLoopBackOff state. Such behavior is expected in non-HA StackLight setups. The Grafana pod will resume normal operation after PostgreSQL is restored.

To prevent the issue, deploy StackLight in HA mode.