Known issues¶
This section lists MOSK known issues with workarounds for the MOSK release 24.2.4:
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[43058] [Antelope] Cronjob for MariaDB is not created¶
Sometimes, after changing the OpenStackDeployment
custom resource,
it does not transition to the APPLYING
state as expected.
To work around the issue, restart the openstack-controller
pod in
the osh-system
namespace.
Tungsten Fabric¶
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
Update known issues¶
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
[46671] Cluster update fails with the tf-config pods crashed¶
When updating to the MOSK 24.3 series, tf-config
pods from the Tungsten
Fabric namespace may enter the CrashLoopBackOff
state. For example:
tf-config-cs8zr 2/5 CrashLoopBackOff 676 (19s ago) 15h
tf-config-db-6zxgg 1/1 Running 44 (25m ago) 15h
tf-config-db-7k5sz 1/1 Running 43 (23m ago) 15h
tf-config-db-dlwdv 1/1 Running 43 (25m ago) 15h
tf-config-nw4tr 3/5 CrashLoopBackOff 665 (43s ago) 15h
tf-config-wzf6c 1/5 CrashLoopBackOff 680 (10s ago) 15h
tf-control-c6bnn 3/4 Running 41 (23m ago) 13h
tf-control-gsnnp 3/4 Running 42 (23m ago) 13h
tf-control-sj6fd 3/4 Running 41 (23m ago) 13h
To troubleshoot the issue, check the logs inside the tf-config
API
container and the tf-cassandra
pods. The following example logs
indicate that Cassandra services failed to peer with each other and
are operating independently:
Logs from the
tf-config
API container:NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})
Logs from the
tf-cassandra
pods:INFO [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling WARN [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready
To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2
Now, all other services in the Tungsten Fabric namespace should be in
the Active
state.
[47602] Failed designate-zone-setup job blocks cluster update¶
The designate-zone-setup
Kubernetes job in the openstack
namespace
fails during update to MOSK 24.3 with the following error present in the
logs of the job pod:
openstack.exceptions.BadRequestException: BadRequestException: 400:
Client Error for url: http://designate-api.openstack.svc.cluster.local:9001/v2/zones,
Invalid TLD
The issue occurs when the DNS service (OpenStack Designate) has any TLDs
created, but test
is not among them. Since DNS service monitoring
was added to MOSK 24.3, it attempts to create a test zone test-zone.test
in the Designate service, which fails if the test
TLD is missing.
To work around the issue, verify that there are created TLDs present in the DNS service:
openstack tld list -f value -c name
If there are TLDs present and test
is not one of them, create it:
Warning
Do not create the test
TLD if no TLDs were present
in the DNS service initially. In this case, the issue is caused by
a different factor, and creating the test
TLD when none existed
before may disrupt users of both the DNS and Networking services.
openstack tld create --name test
Example output:
+-------------+--------------------------------------+
| Field | Value |
+-------------+--------------------------------------+
| created_at | 2024-10-22T19:22:15.000000 |
| description | None |
| id | 930fed8b-1e91-4c8c-a00f-7abf68b944d0 |
| name | test |
| updated_at | None |
+-------------+--------------------------------------+