Tungsten Fabric known issues

This section lists the Tungsten Fabric known issues with workarounds for the Mirantis OpenStack for Kubernetes release 24.2.

Note

For the Tungsten Fabric limitations, refer to Tungsten Fabric known limitations.

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[40900] Cassandra DB infinite table creation/changing state in Tungsten Fabric

Fixed in 24.2.1 Fixed in 24.3

During initial deployment of a Tungsten Fabric cluster, there is a known issue where the Cassandra database may enter an infinite table creation or changing state. This results in the Tungsten Fabric configuration pods failing to reach the Ready state.

The root cause of this issue is a schema mismatch within Cassandra.

To verify whether the cluster is affected:

The symptoms of this issue can be observed by verifying the Tungsten Fabric configuration pods:

kubectl describe pod <TF_CONFIG_POD_NAME> -n tf

The following events might be observed:

Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  35m (x64 over 78m)   kubelet  Readiness probe failed: contrail-svc-monitor: initializing (Database:Cassandra[] connection down)
  Warning  BackOff    30m (x128 over 63m)  kubelet  Back-off restarting failed container svc-monitor in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
  Warning  BackOff    25m (x42 over 44m)   kubelet  Back-off restarting failed container api in pod tf-config-mfkfc_tf(fc77e6b6-d7b9-4680-bffc-618796a754af)
  Normal   Started    20m (x18 over 80m)   kubelet  Started container svc-monitor
  Warning  Unhealthy  14m (x90 over 70m)   kubelet  (combined from similar events): Readiness probe errored: rpc error: code = Unknown desc = container not running (38fc8f52b45a0918363e5617c0d0181d72a01435a6c1ec4021301e2a0e75805e)

The events above indicate that the configuration services remain in the initializing state after deployment due to inability to connect to the database. As a result, liveness and readiness probes fail, and the pods continuously restart.

Additionally, each node of Cassandra configuration database logs similar errors:

INFO  [MigrationStage:1] 2024-05-29 23:13:23,089 MigrationCoordinator.java:531 - Sending schema pull request to /192.168.159.144 at 1717024403089 with timeout 10000
ERROR [InternalResponseStage:123] 2024-05-29 23:13:24,792 MigrationCoordinator.java:491 - Unable to merge schema from /192.168.159.144
org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found a1de4100-1e02-11ef-8797-b7bb8876c3bc; expected a22ef910-1e02-11ef-a135-a747d50b5bce)
 at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:1000)
 at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:953)
 at org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
 at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1495)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1451)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1413)
 at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1390)
 at org.apache.cassandra.service.MigrationCoordinator.mergeSchemaFrom(MigrationCoordinator.java:449)
 at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:487)
 at org.apache.cassandra.service.MigrationCoordinator$Callback.response(MigrationCoordinator.java:475)
 at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
 at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
 at java.lang.Thread.run(Thread.java:750)

Workaround:

To resolve this issue temporarily, restart the affected Cassandra pod:

kubectl delete pod <TF_CASSANDRA_CONFIG_POD_NAME> -n tf

After the pod is restarted, monitor the status of other Tungsten Fabric pods. If they become Ready within two minutes, the issue is resolved. Otherwise, inspect the latest Cassandra logs in other pods and restart any other pods exhibiting the same pattern of errors:

kubectl delete pod <ANOTHER_TF_CASSANDRA_CONFIG_POD_NAME> -n tf

Repeat the process until all affected pods become Ready.