Update known issues

This section lists the update known issues with workarounds for the MOSK release 24.2.

[42449] Rolling reboot failure on a Tungsten Fabric cluster

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[46671] Cluster update fails with the tf-config pods crashed

When updating to the MOSK 24.3 series, tf-config pods from the Tungsten Fabric namespace may enter the CrashLoopBackOff state. For example:

tf-config-cs8zr                            2/5     CrashLoopBackOff   676 (19s ago)   15h
tf-config-db-6zxgg                         1/1     Running            44 (25m ago)    15h
tf-config-db-7k5sz                         1/1     Running            43 (23m ago)    15h
tf-config-db-dlwdv                         1/1     Running            43 (25m ago)    15h
tf-config-nw4tr                            3/5     CrashLoopBackOff   665 (43s ago)   15h
tf-config-wzf6c                            1/5     CrashLoopBackOff   680 (10s ago)   15h
tf-control-c6bnn                           3/4     Running            41 (23m ago)    13h
tf-control-gsnnp                           3/4     Running            42 (23m ago)    13h
tf-control-sj6fd                           3/4     Running            41 (23m ago)    13h

To troubleshoot the issue, check the logs inside the tf-config API container and the tf-cassandra pods. The following example logs indicate that Cassandra services failed to peer with each other and are operating independently:

  • Logs from the tf-config API container:

    NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})
    
  • Logs from the tf-cassandra pods:

    INFO  [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling
    WARN  [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready
    

To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:

kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2

Now, all other services in the Tungsten Fabric namespace should be in the Active state.