Tungsten Fabric known issues

This section lists the Tungsten Fabric known issues with workarounds for the Mirantis OpenStack for Kubernetes release 24.2.

Note

For the Tungsten Fabric limitations, refer to Tungsten Fabric known limitations.

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[40900] Cassandra DB infinite table creation/changing state in Tungsten Fabric

Fixed in 24.2.1 Fixed in 24.3

During initial deployment of a Tungsten Fabric cluster, there is a known issue where the Cassandra database may enter an infinite table creation or changing state. This results in the Tungsten Fabric configuration pods failing to reach the Ready state.

The root cause of this issue is a schema mismatch within Cassandra.

To verify whether the cluster is affected:

The symptoms of this issue can be observed by verifying the Tungsten Fabric configuration pods:

kubectl describe pod <TF_CONFIG_POD_NAME> -n tf

The following events might be observed:

Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  35m (x64 over 78m)   kubelet  Readiness probe failed: contrail-svc-monitor: initializing (Database:Cassandra[] connection down)
  Warning  BackOff    30m (x128 over 63m)  kubelet  Back-off restarting failed container svc-monitor in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
  Warning  BackOff    25m (x42 over 44m)   kubelet  Back-off restarting failed container api in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
  Normal   Started    20m (x18 over 80m)   kubelet  Started container svc-monitor
  Warning  Unhealthy  14m (x90 over 70m)   kubelet  (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = container not running (38fc8f52b45a0918363e5617c0d0181d72a01435a6c1ec4021301e2a0e75805e)

The events above indicate that the configuration services remain in the initializing state after deployment due to inability to connect to the database. As a result, liveness and readiness probes fail, and the pods continuously restart.

Additionally, each node of Cassandra configuration database logs similar errors:

INFO  [MigrationStage:1] 2024-05-29 23:13:23,089 MigrationCoordinator.java:531 - Sending schema pull request to /192.168.159.144 at 1717024403089 with timeout 10000
ERROR [InternalResponseStage:123] 2024-05-29 23:13:24,792 MigrationCoordinator.java:491 - Unable to merge schema from /192.168.159.144
org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found a1de4100-1e02-11ef-8797-b7bb8876c3bc; expected a22ef910-1e02-11ef-a135-a747d50b5bce)
 at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:1000)
 at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:953)
 at org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
 at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1495)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1451)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1413)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1390)
 at org.apache.cassandra.service.MigrationCoordinator.mergeSchemaFrom(MigrationCoordinator.java:449)
 at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:487)
 at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:475)
 at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
 at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
 at java.lang.Thread.run(Thread.java:750)

Workaround:

To resolve this issue temporarily, restart the affected Cassandra pod:

kubectl delete pod <TF_CASSANDRA_CONFIG_POD_NAME> -n tf

After the pod is restarted, monitor the status of other Tungsten Fabric pods. If they become Ready within two minutes, the issue is resolved. Otherwise, inspect the latest Cassandra logs in other pods and restart any other pods exhibiting the same pattern of errors:

kubectl delete pod <ANOTHER_TF_CASSANDRA_CONFIG_POD_NAME> -n tf

Repeat the process until all affected pods become Ready.

[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node

After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.

To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status

Example of the system response with outdated IP addresses:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns    Host ID                               Rack
UN  192.168.201.144  509.43 KiB  256          ?       7e760a99-fae5-4921-b0c5-d9e6e1eca1c5  rack1
UN  192.168.50.146   534.18 KiB  256          ?       2248ea35-85d4-4887-820b-1fac4733021f  rack1
UN  192.168.145.147  484.19 KiB  256          ?       d988aaaa-44ae-4fec-a617-0b0a253e736d  rack1
DN  192.168.145.144  481.53 KiB  256          ?       c23703a1-6854-47a7-a4a2-af649d63af0c  rack1

An extra node will appear in the cluster with an outdated IP address (the IP of the terminated Cassandra pod) in the Down state.

To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:

kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>