Tungsten Fabric known issues¶
This section lists the Tungsten Fabric known issues with workarounds for the Mirantis OpenStack for Kubernetes release 24.2.
Note
For the Tungsten Fabric limitations, refer to Tungsten Fabric known limitations.
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[40900] Cassandra DB infinite table creation/changing state in Tungsten Fabric¶
During initial deployment of a Tungsten Fabric cluster, there is a known issue
where the Cassandra database may enter an infinite table creation or changing
state. This results in the Tungsten Fabric configuration pods failing to reach
the Ready
state.
The root cause of this issue is a schema mismatch within Cassandra.
To verify whether the cluster is affected:
The symptoms of this issue can be observed by verifying the Tungsten Fabric configuration pods:
kubectl describe pod <TF_CONFIG_POD_NAME> -n tf
The following events might be observed:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 35m (x64 over 78m) kubelet Readiness probe failed: contrail-svc-monitor: initializing (Database:Cassandra[] connection down)
Warning BackOff 30m (x128 over 63m) kubelet Back-off restarting failed container svc-monitor in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
Warning BackOff 25m (x42 over 44m) kubelet Back-off restarting failed container api in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
Normal Started 20m (x18 over 80m) kubelet Started container svc-monitor
Warning Unhealthy 14m (x90 over 70m) kubelet (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = container not running (38fc8f52b45a0918363e5617c0d0181d72a01435a6c1ec4021301e2a0e75805e)
The events above indicate that the configuration services remain in the initializing state after deployment due to inability to connect to the database. As a result, liveness and readiness probes fail, and the pods continuously restart.
Additionally, each node of Cassandra configuration database logs similar errors:
INFO [MigrationStage:1] 2024-05-29 23:13:23,089 MigrationCoordinator.java:531 - Sending schema pull request to /192.168.159.144 at 1717024403089 with timeout 10000
ERROR [InternalResponseStage:123] 2024-05-29 23:13:24,792 MigrationCoordinator.java:491 - Unable to merge schema from /192.168.159.144
org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found a1de4100-1e02-11ef-8797-b7bb8876c3bc; expected a22ef910-1e02-11ef-a135-a747d50b5bce)
at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:1000)
at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:953)
at org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1495)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1451)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1413)
at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1390)
at org.apache.cassandra.service.MigrationCoordinator.mergeSchemaFrom(MigrationCoordinator.java:449)
at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:487)
at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:475)
at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
at java.lang.Thread.run(Thread.java:750)
Workaround:
To resolve this issue temporarily, restart the affected Cassandra pod:
kubectl delete pod <TF_CASSANDRA_CONFIG_POD_NAME> -n tf
After the pod is restarted, monitor the status of other Tungsten Fabric pods.
If they become Ready
within two minutes, the issue is resolved. Otherwise,
inspect the latest Cassandra logs in other pods and restart any other pods
exhibiting the same pattern of errors:
kubectl delete pod <ANOTHER_TF_CASSANDRA_CONFIG_POD_NAME> -n tf
Repeat the process until all affected pods become Ready
.
[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶
After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.
To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status
Example of the system response with outdated IP addresses:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.201.144 509.43 KiB 256 ? 7e760a99-fae5-4921-b0c5-d9e6e1eca1c5 rack1
UN 192.168.50.146 534.18 KiB 256 ? 2248ea35-85d4-4887-820b-1fac4733021f rack1
UN 192.168.145.147 484.19 KiB 256 ? d988aaaa-44ae-4fec-a617-0b0a253e736d rack1
DN 192.168.145.144 481.53 KiB 256 ? c23703a1-6854-47a7-a4a2-af649d63af0c rack1
An extra node will appear in the cluster with an outdated IP address
(the IP of the terminated Cassandra pod) in the Down
state.
To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>