Cluster update known issues¶
This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 22.1.
[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found
[22725] Live migration may fail for instances with deleted images
[18871] MySQL crashes during managed cluster update or instances live migration
[21998] OpenStack Controller may get stuck during the managed cluster update
[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found¶
A Ceph cluster fails to update on a managed cluster with the following message:
Failed to configure Ceph cluster: ceph cluster verification is failed:
[Daemonset csi-rbdplugin is not found]
As a workaround, restart the rook-ceph-operator
pod:
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
[22725] Live migration may fail for instances with deleted images¶
During the update of a MOSK cluster to 22.1, live
migration may fail for instances if their images were previously deleted. In
this case, the nova-compute
pod contains an error message similar to the
following one:
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Command: scp -C -r kaas-node-03ab613d-cf79-4830-ac70-ed735453481a:/var/l
ib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8 /var/lib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Exit code: 1
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stdout: ''
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stderr: 'ssh: Could not resolve hostname kaas-node-03ab613d-cf79-4830-ac
70-ed735453481a: Name or service not known\r\n'
Workaround:
If you have not yet started the managed cluster update, change the
nova-compute
image by setting the following metadata in theOpenStackDeployment
CR:spec: services: compute: nova: values: images: tags: nova_compute: mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
If you have already started the managed cluster update, manually update the
nova-compute
container image in thenova-compute
DaemonSet tomirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
.
[16987] Cluster update fails at Ceph CSI pod eviction¶
An update of a MOSK cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.
Workaround:
Scale the affected
StatefulSet
of the pod that fails to init down to0
replicas. If it is theDaemonSet
such asnova-compute
, it must not be scheduled on the affected node.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:rbd device list | grep <csi-vol-uuid>
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale the affected
StatefulSet
back to the original number of replicas or until its state isRunning
. If it is aDaemonSet
, run the pod on the affected node again.
[18871] MySQL crashes during managed cluster update or instances live migration¶
MySQL may crash when performing instances live migration or during a managed
cluster update. After the crash, MariaDB cannot connect to the cluster and gets
stuck in the CrashLoopBackOff
state.
Workaround:
Verify that other MariaDB replicas are up and running and have joined the cluster:
Verify that at least 2 pods are running and operational (
2/2
andRunning
):kubectl -n openstack get pods |grep maria
Example of system response where the pods
mariadb-server-0
andmariadb-server-2
are operational:mariadb-controller-77b5ff47d5-ndj68 1/1 Running 0 39m mariadb-server-0 2/2 Running 0 39m mariadb-server-1 0/2 Running 0 39m mariadb-server-2 2/2 Running 0 39m
Log in to each operational pod and verify that the node is
Primary
and the cluster size is at least2
. For example:mysql -u root -p$MYSQL_DBADMIN_PASSWORD -e "show status;" |grep -e \ wsrep_cluster_size -e "wsrep_cluster_status" -e "wsrep_local_state_comment"
Example of system response:
wsrep_cluster_size 2 wsrep_cluster_status Primary wsrep_local_state_comment Synced
Remove the content of the
/var/lib/mysql/*
directory:kubectl -n openstack exec -it mariadb-server-1 – rm -rf /var/lib/mysql/*
Restart the MariaDB container:
kubectl -n openstack delete pod mariadb-server-1
[21998] OpenStack Controller may get stuck during the managed cluster update¶
During the MOSK cluster update, the OpenStack Controller may get stuck with the following symptoms:
Multiple
nodemaintenancerequests
exist:kubectl get nodemaintenancerequests NAME AGE kaas-node-50a51d95-1e4b-487e-a973-199de400b97d 17m kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161 85s
One
nodemaintenancerequest
has aDeletedAt
time stamp and an activeopenstack-controller
finalizer:finalizers: - lcm.mirantis.com/openstack-controller.nodemaintenancerequest-finalizer
In the
openstack-controller
logs, retries are exhausted:2022-02-17 18:41:43,317 [ERROR] kopf._core.engines.peering: Request attempt #8 failed; will retry: PATCH https://10.232.0.1:443/apis/zalando.org/v1/namespaces/openstack/kopfpeerings/openstack-controller.nodemaintenancerequest -> APIServerError('Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', 'reason': 'InternalError', 'details': {'causes': [{'message': 'unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool'}]}, 'code': 500}) 2022-02-17 18:42:50,834 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:47:50,848 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:52:50,853 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 18:57:50,858 [INFO] kopf.objects: Timer 'heartbeat' succeeded. 2022-02-17 19:02:50,862 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
Notification about a successful finish does not exist:
kopf.objects: Handler 'node_maintenance_request_delete_handler' succeeded.
As a workaround, delete the OpenStack Controller pod:
kubectl -n osh-system delete pod -l app.kubernetes.io/name=openstack-operator
[22321] Neutron back end may change from TF to ML2¶
An update of the MOSK cluster with Tungsten Fabric may hang due to the changed Neutron back end with the following symptoms:
The
libvirt
andnova-compute
pods fail to start:Entrypoint WARNING: 2022/03/03 08:49:45 entrypoint.go:72: Resolving dependency Pod on same host with labels map[application:neutron component:neutron-ovs-agent] in namespace openstack failed: Found no pods matching labels: map[application:neutron component:neutron-ovs-agent] .
In the
OSDPL
network section, theml2
back end is specified instead oftungstenfabric
:spec: features: neutron: backend: ml2
As a workaround, change the back end option from ml2
to
tungstenfabric
:
spec:
features:
neutron:
backend: tungstenfabric