Known issues¶
This section lists MOSK known issues with workarounds for the Mirantis OpenStack for Kubernetes release 24.1.2.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
Due to the upstream MariaDB issue, during MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[36524] etcd enters a panic state after replacement of the controller node¶
After provisioning the controller node, the etcd pod initiates before the Kubernetes networking is fully operational. As a result, the pod encounters difficulties resolving DNS and establishing connections with other members, ultimately leading to a panic state for the etcd service.
Workaround:
Delete the PVC related to the replaced controller node:
kubectl -n openstack delete pvc <PVC-NAME>
Delete pods related to the crashing etcd service on the replaced controller node:
kubectl -n openstack delete pods <ETCD-POD-NAME>
[41810] Cluster update is stuck due to the OpenStack controller flooding¶
The cluster update may stuck if the maximum number of the worker nodes to update simultaneously is ten or higher.
To work around the problem, set the
spec.providerSpec.maxWorkerUpgradeCount
to a value lower than 10
.
For configuration details, see Configure the parallel update of worker nodes.
Tungsten Fabric¶
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>