Update known issues¶

This section lists the update known issues with workarounds for the MOSK release 24.2.

[42449] Rolling reboot failure on a Tungsten Fabric cluster¶

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[42463] KubePodsCrashLooping is firing during cluster update¶

During major or patch update of a MOSK cluster with StackLight enabled in non-HA mode, the KubePodsCrashLooping alert may be firing for the Grafana ReplicaSet.

Grafana relies on PostgreSQL for persistent data. In non-HA StackLight setup, PostgreSQL becomes temporarily unavailable during updates. If Grafana loses its database connection or fails to establish one during startup, Grafana fails with an error. This may cause the Grafana pod to enter the CrashLoopBackOff state. Such behavior is expected in non-HA StackLight setups. The Grafana pod will resume normal operation after PostgreSQL is restored.

To prevent the issue, deploy StackLight in HA mode.

[46671] Cluster update fails with the tf-config pods crashed¶

When updating to the MOSK 24.3 series, tf-config pods from the Tungsten Fabric namespace may enter the CrashLoopBackOff state. For example:

tf-config-cs8zr                            2/5     CrashLoopBackOff   676 (19s ago)   15h
tf-config-db-6zxgg                         1/1     Running            44 (25m ago)    15h
tf-config-db-7k5sz                         1/1     Running            43 (23m ago)    15h
tf-config-db-dlwdv                         1/1     Running            43 (25m ago)    15h
tf-config-nw4tr                            3/5     CrashLoopBackOff   665 (43s ago)   15h
tf-config-wzf6c                            1/5     CrashLoopBackOff   680 (10s ago)   15h
tf-control-c6bnn                           3/4     Running            41 (23m ago)    13h
tf-control-gsnnp                           3/4     Running            42 (23m ago)    13h
tf-control-sj6fd                           3/4     Running            41 (23m ago)    13h

To troubleshoot the issue, check the logs inside the tf-config API container and the tf-cassandra pods. The following example logs indicate that Cassandra services failed to peer with each other and are operating independently:

Logs from the tf-config API container:

NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})

Logs from the tf-cassandra pods:

INFO  [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling
WARN  [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready

To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:

kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2

Now, all other services in the Tungsten Fabric namespace should be in the Active state.