Cluster update known issues¶
This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 22.1.
[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found
[22725] Live migration may fail for instances with deleted images
[18871] MySQL crashes during managed cluster update or instances live migration
[21998] OpenStack Controller may get stuck during the managed cluster update
[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found¶
A Ceph cluster fails to update on a managed cluster with the following message:
Failed to configure Ceph cluster: ceph cluster verification is failed:
[Daemonset csi-rbdplugin is not found]
As a workaround, restart the rook-ceph-operator
pod:
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
[22725] Live migration may fail for instances with deleted images¶
During the update of a MOSK cluster to 22.1, live
migration may fail for instances if their images were previously deleted. In
this case, the nova-compute
pod contains an error message similar to the
following one:
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Command: scp -C -r kaas-node-03ab613d-cf79-4830-ac70-ed735453481a:/var/l
ib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8 /var/lib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Exit code: 1
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stdout: ''
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stderr: 'ssh: Could not resolve hostname kaas-node-03ab613d-cf79-4830-ac
70-ed735453481a: Name or service not known\r\n'
Workaround:
If you have not yet started the managed cluster update, change the
nova-compute
image by setting the following metadata in theOpenStackDeployment
CR:spec: services: compute: nova: values: images: tags: nova_compute: mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
If you have already started the managed cluster update, manually update the
nova-compute
container image in thenova-compute
DaemonSet tomirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
.
[16987] Cluster update fails at Ceph CSI pod eviction¶
An update of a MOSK cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.
Workaround:
Scale the affected
StatefulSet
of the pod that fails to init down to0
replicas. If it is theDaemonSet
such asnova-compute
, it must not be scheduled on the affected node.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:rbd device list | grep <csi-vol-uuid>
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale the affected
StatefulSet
back to the original number of replicas or until its state isRunning
. If it is aDaemonSet
, run the pod on the affected node again.
[18871] MySQL crashes during managed cluster update or instances live migration¶
MySQL may crash when performing instances live migration or during a managed
cluster update. After the crash, MariaDB cannot connect to the cluster and gets
stuck in the CrashLoopBackOff
state.
Workaround:
Verify that other MariaDB replicas are up and running and have joined the cluster:
Verify that at least 2 pods are running and operational (
2/2
andRunning
):kubectl -n openstack get pods |grep maria
Example of system response where the pods
mariadb-server-0
andmariadb-server-2
are operational:mariadb-controller-77b5ff47d5-ndj68 1/1 Running 0 39m mariadb-server-0 2/2 Running 0 39m mariadb-server-1 0/2 Running 0 39m mariadb-server-2 2/2 Running 0 39m
Log in to each operational pod and verify that the node is
Primary
and the cluster size is at least2
. For example:mysql -u root -p$MYSQL_DBADMIN_PASSWORD -e "show status;" |grep -e \ wsrep_cluster_size -e "wsrep_cluster_status" -e "wsrep_local_state_comment"
Example of system response:
wsrep_cluster_size 2 wsrep_cluster_status Primary wsrep_local_state_comment Synced
Remove the content of the
/var/lib/mysql/*
directory:kubectl -n openstack exec -it mariadb-server-1 – rm -rf /var/lib/mysql/*
Restart the MariaDB container:
kubectl -n openstack delete pod mariadb-server-1
[21998] OpenStack Controller may get stuck during the managed cluster update¶
During the MOSK cluster update, the OpenStack Controller may get stuck with the following symptoms:
Multiple
nodemaintenancerequests
exist:kubectl get nodemaintenancerequests NAME AGE kaas-node-50a51d95-1e4b-487e-a973-199de400b97d 17m kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161 85s
One
nodemaintenancerequest
has aDeletedAt
time stamp and an activeopenstack-controller
finalizer:finalizers: - lcm.mirantis.com/openstack-controller.nodemaintenancerequest-finalizer
In the
openstack-controller
logs, retries are exhausted:2022-02-17 18:41:43,317 [ERROR] kopf._core.engines.peering: Request attempt #8 failed; will retry: PATCH https://10.232.0.1:443/apis/zalando.org/v1/namespaces/openstack/kopfpeerings/openstack-controller.nodemaintenancerequest -> APIServerError('Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', 'reason': 'InternalError', 'details': {'causes': [{'message': 'unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool'}]}, 'code': 500}) 2022-02-17 18:42:50,834 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:47:50,848 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:52:50,853 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:57:50,858 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 19:02:50,862 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
Notification about a successful finish does not exist:
kopf.objects: Handler 'node_maintenance_request_delete_handler' succeeded.
As a workaround, delete the OpenStack Controller pod:
kubectl -n osh-system delete pod -l app.kubernetes.io/name=openstack-operator
[22321] Neutron backend may change from TF to ML2¶
An update of the MOSK cluster with Tungsten Fabric may hang due to the changed Neutron backend with the following symptoms:
The
libvirt
andnova-compute
pods fail to start:Entrypoint WARNING: 2022/03/03 08:49:45 entrypoint.go:72: Resolving dependency Pod on same host with labels map[application:neutron component:neutron-ovs-agent] in namespace openstack failed: Found no pods matching labels: map[application:neutron component:neutron-ovs-agent] .
In the
OSDPL
network section, theml2
backend is specified instead oftungstenfabric
:spec: features: neutron: backend: ml2
As a workaround, change the backend option from ml2
to
tungstenfabric
:
spec:
features:
neutron:
backend: tungstenfabric