Update known issues¶
This section lists the update known issues with workarounds for the MOSK release 24.2.
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
[42463] KubePodsCrashLooping is firing during cluster update¶
During major or patch update of a MOSK cluster with StackLight enabled in
non-HA mode, the KubePodsCrashLooping
alert may be firing for the Grafana
ReplicaSet
.
Grafana relies on PostgreSQL for persistent data. In non-HA StackLight setup,
PostgreSQL becomes temporarily unavailable during updates. If Grafana loses its
database connection or fails to establish one during startup, Grafana fails
with an error. This may cause the Grafana pod to enter the CrashLoopBackOff
state. Such behavior is expected in non-HA StackLight setups. The Grafana pod
will resume normal operation after PostgreSQL is restored.
To prevent the issue, deploy StackLight in HA mode.
[46671] Cluster update fails with the tf-config pods crashed¶
When updating to the MOSK 24.3 series, tf-config
pods from the Tungsten
Fabric namespace may enter the CrashLoopBackOff
state. For example:
tf-config-cs8zr 2/5 CrashLoopBackOff 676 (19s ago) 15h
tf-config-db-6zxgg 1/1 Running 44 (25m ago) 15h
tf-config-db-7k5sz 1/1 Running 43 (23m ago) 15h
tf-config-db-dlwdv 1/1 Running 43 (25m ago) 15h
tf-config-nw4tr 3/5 CrashLoopBackOff 665 (43s ago) 15h
tf-config-wzf6c 1/5 CrashLoopBackOff 680 (10s ago) 15h
tf-control-c6bnn 3/4 Running 41 (23m ago) 13h
tf-control-gsnnp 3/4 Running 42 (23m ago) 13h
tf-control-sj6fd 3/4 Running 41 (23m ago) 13h
To troubleshoot the issue, check the logs inside the tf-config
API
container and the tf-cassandra
pods. The following example logs
indicate that Cassandra services failed to peer with each other and
are operating independently:
Logs from the
tf-config
API container:NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})
Logs from the
tf-cassandra
pods:INFO [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling WARN [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready
To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2
Now, all other services in the Tungsten Fabric namespace should be in
the Active
state.