Tungsten Fabric known issues¶
This section lists the Tungsten Fabric known issues with workarounds for the Mirantis OpenStack for Kubernetes release 24.2.
Note
For the Tungsten Fabric limitations, refer to Tungsten Fabric known limitations.
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[40900] Cassandra DB infinite table creation/changing state in Tungsten Fabric¶
During initial deployment of a Tungsten Fabric cluster, there is a known issue
where the Cassandra database may enter an infinite table creation or changing
state. This results in the Tungsten Fabric configuration pods failing to reach
the Ready
state.
The root cause of this issue is a schema mismatch within Cassandra.
To verify whether the cluster is affected:
The symptoms of this issue can be observed by verifying the Tungsten Fabric configuration pods:
kubectl describe pod <TF_CONFIG_POD_NAME> -n tf
The following events might be observed:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 35m (x64 over 78m) kubelet Readiness probe failed: contrail-svc-monitor: initializing (Database:Cassandra[] connection down)
Warning BackOff 30m (x128 over 63m) kubelet Back-off restarting failed container svc-monitor in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
Warning BackOff 25m (x42 over 44m) kubelet Back-off restarting failed container api in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
Normal Started 20m (x18 over 80m) kubelet Started container svc-monitor
Warning Unhealthy 14m (x90 over 70m) kubelet (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = container not running (38fc8f52b45a0918363e5617c0d0181d72a01435a6c1ec4021301e2a0e75805e)
The events above indicate that the configuration services remain in the initializing state after deployment due to inability to connect to the database. As a result, liveness and readiness probes fail, and the pods continuously restart.
Additionally, each node of Cassandra configuration database logs similar errors:
INFO [MigrationStage:1] 2024-05-29 23:13:23,089 MigrationCoordinator.java:531 - Sending schema pull request to /192.168.159.144 at 1717024403089 with timeout 10000
ERROR [InternalResponseStage:123] 2024-05-29 23:13:24,792 MigrationCoordinator.java:491 - Unable to merge schema from /192.168.159.144
org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found a1de4100-1e02-11ef-8797-b7bb8876c3bc; expected a22ef910-1e02-11ef-a135-a747d50b5bce)
at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:1000)
at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:953)
at org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1495)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1451)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1413)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1390)
at org.apache.cassandra.service.MigrationCoordinator.mergeSchemaFrom(MigrationCoordinator.java:449)
at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:487)
at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:475)
at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
at java.lang.Thread.run(Thread.java:750)
Workaround:
To resolve this issue temporarily, restart the affected Cassandra pod:
kubectl delete pod <TF_CASSANDRA_CONFIG_POD_NAME> -n tf
After the pod is restarted, monitor the status of other Tungsten Fabric pods.
If they become Ready
within two minutes, the issue is resolved. Otherwise,
inspect the latest Cassandra logs in other pods and restart any other pods
exhibiting the same pattern of errors:
kubectl delete pod <ANOTHER_TF_CASSANDRA_CONFIG_POD_NAME> -n tf
Repeat the process until all affected pods become Ready
.