Known issues¶
This section lists MOSK known issues with workarounds for the MOSK release 24.1.7.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[36524] etcd enters a panic state after replacement of the controller node¶
After provisioning the controller node, the etcd pod initiates before the Kubernetes networking is fully operational. As a result, the pod encounters difficulties resolving DNS and establishing connections with other members, ultimately leading to a panic state for the etcd service.
Workaround:
Delete the PVC related to the replaced controller node:
kubectl -n openstack delete pvc <PVC-NAME>
Delete pods related to the crashing etcd service on the replaced controller node:
kubectl -n openstack delete pods <ETCD-POD-NAME>
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[43058] [Antelope] Cronjob for MariaDB is not created¶
Sometimes, after changing the OpenStackDeployment
custom resource,
it does not transition to the APPLYING
state as expected.
To work around the issue, restart the openstack-controller
pod in
the osh-system
namespace.
[44813] [Antelope] Traffic disruption observed on trunk ports¶
After upgrading to OpenStack Antelope, clusters with configured trunk ports experience traffic flow disruptions that block the cluster updates.
To work around the issue, pin the MOSK Networking service (OpenStack Neutron) container image by adding the following content to the OpenStackDeployment custom resource:
spec:
services:
networking:
neutron:
values:
images:
tags:
neutron_openvswitch_agent: mirantis.azurecr.io/openstack/neutron:antelope-jammy-20240816113600
Caution
Remove the pinning after updating to MOSK 24.2.1 or later patch or major release.
[45879] [Antelope] Incorrect packet handling between instance and its gateway¶
After upgrade to OpenStack Antelope, the virtual machines experience connectivity disruptions when sending data over the virtual networks. Network packets with full MTU are dropped.
The issue affects the MOSK clusters with Open vSwitch as the networking backend and with the following specific MTU settings:
The MTU configured on the tunnel interface of compute nodes is equal to the value of the
spec:services:networking:neutron:values:conf:neutron:DEFAULT:global_physnet_mtu
parameter of theOpenStackDeployment
custom resource (if not specified, default is1500
bytes).If the MTU of the tunnel interface is higher by at least 4 bytes, the cluster is not affected by the issue.
The cluster contains virtual machines that have the MTU of the network interfaces of the guest operating system larger than the MTU of the value of the
global_physnet_mtu
parameter above minus 50 bytes.
To work around the issue, pin the MOSK Networking
service (OpenStack Neutron) container image by adding the following content
to the OpenStackDeployment
custom resource:
spec:
services:
networking:
neutron:
values:
images:
tags:
neutron_openvswitch_agent: mirantis.azurecr.io/openstack/neutron:antelope-jammy-20240816113600
Caution
Remove the pinning after updating to MOSK 24.2.1 or later patch or major release.
Tungsten Fabric¶
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
Ceph¶
[42903] Inconsistent handling of missing pools by ceph-controller¶
In rare cases, when ceph-controller
cannot confirm the existence of
MOSK pools, instead of denying action and raising errors,
it proceeds to recreate the Cinder Ceph client. Such behavior may
potentially cause issues with OpenStack workloads.
Workaround:
In
spec.cephClusterSpec
of theKaaSCephCluster
custom resource, remove theexternal
section.Wait for the Not all mgrs are running: 1/2 message to disappear from the
KaaSCephCluster
status
.Verify that the
nova
Ceph client that is integrated to MOSK has the same keyring as in the Ceph cluster.Keyring verification for the Ceph
nova
clientCompare the keyring used in the
nova-compute
andlibvirt
pods with the one from the Ceph cluster:kubectl -n openstack get pod | grep nova-compute kubectl -n openstack exec -it <nova-compute-pod-name> -- cat /etc/ceph/ceph.client.nova.keyring kubectl -n openstack get pod | grep libvirt kubectl -n openstack exec -it <libvirt-pod-name> -- cat /etc/ceph/ceph.client.nova.keyring kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.nova
If the keyring differs, change the one stored in Ceph cluster with the key from the OpenStack pods:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash ceph auth get client.nova -o /tmp/nova.key vi /tmp/nova.key # in the editor, change "key" value to the key obtained from the OpenStack pods # then save and exit editing ceph auth import -i /tmp/nova.key
Verify that the
client.nova
keyring of the Ceph cluster matches the one obtained from the OpenStack pods:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.nova
Verify that
nova-compute
andlibvirt
pods have access to the Ceph cluster:kubectl -n openstack get pod | grep nova-compute kubectl -n openstack exec -it <nova-compute-pod-name> -- ceph -s -n client.nova kubectl -n openstack get pod | grep libvirt kubectl -n openstack exec -it <libvirt-pod-name> -- ceph -s -n client.nova
Verify that the
cinder
Ceph client integrated to MOSK has the same keyring as in the Ceph cluster:Keyring verification for the Ceph
cinder
clientCompare the keyring used in the
cinder-volume
pods with the one from the Ceph cluster.kubectl -n openstack get pod | grep cinder-volume kubectl -n openstack exec -it <cinder-volume-pod-name> -- cat /etc/ceph/ceph.client.cinder.keyring kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.cinder
If the keyring differs, change the one stored in Ceph cluster with the key from the OpenStack pods:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash ceph auth get client.cinder -o /tmp/cinder.key vi /tmp/cinder.key # in the editor, change "key" value to the key obtained from the OpenStack pods # then save and exit editing ceph auth import -i /tmp/cinder.key
Verify that the
client.cinder
keyring of the Ceph cluster matches the one obtained from the OpenStack pods:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.cinder
Verify that the
cinder-volume
pods have access to the Ceph cluster:kubectl -n openstack get pod | grep cinder-volume kubectl -n openstack exec -it <cinder-volume-pod-name> -- ceph -s -n client.cinder
Verify that the
glance
Ceph client integrated to MOSK has the same keyring as in the Ceph cluster.Keyring verification for the Ceph
glance
clientCompare the keyring used in the
glance-api
pods with the one from the Ceph cluster:kubectl -n openstack get pod | grep glance-api kubectl -n openstack exec -it <glance-api-pod-name> -- cat /etc/ceph/ceph.client.glance.keyring kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.glance
If the keyring differs, change the one stored in Ceph cluster with the key from the OpenStack pods:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash ceph auth get client.glance -o /tmp/glance.key vi /tmp/glance.key # in the editor, change "key" value to the key obtained from the OpenStack pods # then save and exit editing ceph auth import -i /tmp/glance.key
Verify that the
client.glance
keyring of the Ceph cluster matches the one obtained from the OpenStack pods:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph auth get client.glance
Verify that the
glance-api
pods have access to the Ceph cluster:kubectl -n openstack get pod | grep glance-api kubectl -n openstack exec -it <glance-api-pod-name> -- ceph -s -n client.glance