Cluster update known issues¶

This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 22.1.

[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found
[22725] Live migration may fail for instances with deleted images
[16987] Cluster update fails at Ceph CSI pod eviction
[18871] MySQL crashes during managed cluster update or instances live migration
[21998] OpenStack Controller may get stuck during the managed cluster update
[22321] Neutron backend may change from TF to ML2

[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found¶

A Ceph cluster fails to update on a managed cluster with the following message:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[Daemonset csi-rbdplugin is not found]

As a workaround, restart the rook-ceph-operator pod:

kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1

[22725] Live migration may fail for instances with deleted images¶

Fixed in MOSK 22.2

During the update of a MOSK cluster to 22.1, live migration may fail for instances if their images were previously deleted. In this case, the nova-compute pod contains an error message similar to the following one:

2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Command: scp -C -r kaas-node-03ab613d-cf79-4830-ac70-ed735453481a:/var/l
ib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8 /var/lib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Exit code: 1
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stdout: ''
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stderr: 'ssh: Could not resolve hostname kaas-node-03ab613d-cf79-4830-ac
70-ed735453481a: Name or service not known\r\n'

Workaround:

If you have not yet started the managed cluster update, change the nova-compute image by setting the following metadata in the OpenStackDeployment CR:

spec:
  services:
    compute:
      nova:
        values:
          images:
            tags:
              nova_compute: mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700

If you have already started the managed cluster update, manually update the nova-compute container image in the nova-compute DaemonSet to mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700.

[16987] Cluster update fails at Ceph CSI pod eviction¶

Fixed in MOSK 22.2

An update of a MOSK cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.

Workaround:

Scale the affected StatefulSet of the pod that fails to init down to 0 replicas. If it is the DaemonSet such as nova-compute, it must not be scheduled on the affected node.
On every csi-rbdplugin pod, search for stuck csi-vol:
```
rbd device list | grep <csi-vol-uuid>
```
Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```

Delete volumeattachment of the affected pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale the affected StatefulSet back to the original number of replicas or until its state is Running. If it is a DaemonSet, run the pod on the affected node again.

[18871] MySQL crashes during managed cluster update or instances live migration¶

Fixed in MOSK 22.2

MySQL may crash when performing instances live migration or during a managed cluster update. After the crash, MariaDB cannot connect to the cluster and gets stuck in the CrashLoopBackOff state.

Workaround:

Verify that other MariaDB replicas are up and running and have joined the cluster:

Verify that at least 2 pods are running and operational (2/2 and Running):

kubectl -n openstack get pods |grep maria

Example of system response where the pods mariadb-server-0 and mariadb-server-2 are operational:

mariadb-controller-77b5ff47d5-ndj68   1/1     Running     0          39m
mariadb-server-0                      2/2     Running     0          39m
mariadb-server-1                      0/2     Running     0          39m
mariadb-server-2                      2/2     Running     0          39m

mysql -u root -p$MYSQL_DBADMIN_PASSWORD -e "show status;" |grep -e \
wsrep_cluster_size -e "wsrep_cluster_status" -e "wsrep_local_state_comment"

Example of system response:

wsrep_cluster_size          2
wsrep_cluster_status        Primary
wsrep_local_state_comment   Synced

Remove the content of the /var/lib/mysql/* directory:

kubectl -n openstack exec -it mariadb-server-1 – rm -rf /var/lib/mysql/*

Restart the MariaDB container:

kubectl -n openstack delete pod mariadb-server-1

[21998] OpenStack Controller may get stuck during the managed cluster update¶

Fixed in MOSK 22.2

During the MOSK cluster update, the OpenStack Controller may get stuck with the following symptoms:

Multiple nodemaintenancerequests exist:

kubectl get nodemaintenancerequests

NAME                                             AGE
kaas-node-50a51d95-1e4b-487e-a973-199de400b97d   17m
kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161   85s

One nodemaintenancerequest has a DeletedAt time stamp and an active openstack-controller finalizer:
```
finalizers:
- lcm.mirantis.com/openstack-controller.nodemaintenancerequest-finalizer
```

In the openstack-controller logs, retries are exhausted:

2022-02-17 18:41:43,317 [ERROR] kopf._core.engines.peering: Request attempt #8 failed; will retry: PATCH https://10.232.0.1:443/apis/zalando.org/v1/namespaces/openstack/kopfpeerings/openstack-controller.nodemaintenancerequest -> APIServerError('Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', 'reason': 'InternalError', 'details': {'causes': [{'message': 'unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool'}]}, 'code': 500})
2022-02-17 18:42:50,834 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
2022-02-17 18:47:50,848 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
2022-02-17 18:52:50,853 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
2022-02-17 18:57:50,858 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
2022-02-17 19:02:50,862 [INFO] kopf.objects: Timer 'heartbeat' succeeded.

Notification about a successful finish does not exist:

kopf.objects: Handler 'node_maintenance_request_delete_handler' succeeded.

As a workaround, delete the OpenStack Controller pod:

kubectl -n osh-system delete pod -l app.kubernetes.io/name=openstack-operator

[22321] Neutron backend may change from TF to ML2¶

An update of the MOSK cluster with Tungsten Fabric may hang due to the changed Neutron backend with the following symptoms:

The libvirt and nova-compute pods fail to start:

Entrypoint WARNING: 2022/03/03 08:49:45 entrypoint.go:72:
Resolving dependency Pod on same host with labels
map[application:neutron component:neutron-ovs-agent] in namespace openstack failed:
Found no pods matching labels: map[application:neutron component:neutron-ovs-agent] .

In the OSDPL network section, the ml2 backend is specified instead of tungstenfabric:
```
spec:
  features:
    neutron:
      backend: ml2
```

As a workaround, change the backend option from ml2 to tungstenfabric:

spec:
  features:
    neutron:
      backend: tungstenfabric