Cluster update known issues

This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 22.1.


[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found

A Ceph cluster fails to update on a managed cluster with the following message:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[Daemonset csi-rbdplugin is not found]

As a workaround, restart the rook-ceph-operator pod:

kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1

[22725] Live migration may fail for instances with deleted images

Fixed in MOSK 22.2

During the update of a MOSK cluster to 22.1, live migration may fail for instances if their images were previously deleted. In this case, the nova-compute pod contains an error message similar to the following one:

2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Command: scp -C -r kaas-node-03ab613d-cf79-4830-ac70-ed735453481a:/var/l
ib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8 /var/lib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Exit code: 1
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stdout: ''
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stderr: 'ssh: Could not resolve hostname kaas-node-03ab613d-cf79-4830-ac
70-ed735453481a: Name or service not known\r\n'

Workaround:

  • If you have not yet started the managed cluster update, change the nova-compute image by setting the following metadata in the OpenStackDeployment CR:

    spec:
      services:
        compute:
          nova:
            values:
              images:
                tags:
                  nova_compute: mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
    
  • If you have already started the managed cluster update, manually update the nova-compute container image in the nova-compute DaemonSet to mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700.


[16987] Cluster update fails at Ceph CSI pod eviction

Fixed in MOSK 22.2

An update of a MOSK cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.

Workaround:

  1. Scale the affected StatefulSet of the pod that fails to init down to 0 replicas. If it is the DaemonSet such as nova-compute, it must not be scheduled on the affected node.

  2. On every csi-rbdplugin pod, search for stuck csi-vol:

    rbd device list | grep <csi-vol-uuid>
    
  3. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    
  4. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  5. Scale the affected StatefulSet back to the original number of replicas or until its state is Running. If it is a DaemonSet, run the pod on the affected node again.


[18871] MySQL crashes during managed cluster update or instances live migration

Fixed in MOSK 22.2

MySQL may crash when performing instances live migration or during a managed cluster update. After the crash, MariaDB cannot connect to the cluster and gets stuck in the CrashLoopBackOff state.

Workaround:

  1. Verify that other MariaDB replicas are up and running and have joined the cluster:

    1. Verify that at least 2 pods are running and operational (2/2 and Running):

      kubectl -n openstack get pods |grep maria
      

      Example of system response where the pods mariadb-server-0 and mariadb-server-2 are operational:

      mariadb-controller-77b5ff47d5-ndj68   1/1     Running     0          39m
      mariadb-server-0                      2/2     Running     0          39m
      mariadb-server-1                      0/2     Running     0          39m
      mariadb-server-2                      2/2     Running     0          39m
      
    2. Log in to each operational pod and verify that the node is Primary and the cluster size is at least 2. For example:

      mysql -u root -p$MYSQL_DBADMIN_PASSWORD -e "show status;" |grep -e \
      wsrep_cluster_size -e "wsrep_cluster_status" -e "wsrep_local_state_comment"
      

      Example of system response:

      wsrep_cluster_size          2
      wsrep_cluster_status        Primary
      wsrep_local_state_comment   Synced
      
  2. Remove the content of the /var/lib/mysql/* directory:

    kubectl -n openstack exec -it mariadb-server-1  rm -rf /var/lib/mysql/*
    
  3. Restart the MariaDB container:

    kubectl -n openstack delete pod mariadb-server-1
    

[21998] OpenStack Controller may get stuck during the managed cluster update

Fixed in MOSK 22.2

During the MOSK cluster update, the OpenStack Controller may get stuck with the following symptoms:

  • Multiple nodemaintenancerequests exist:

    kubectl get nodemaintenancerequests
    
    NAME                                             AGE
    kaas-node-50a51d95-1e4b-487e-a973-199de400b97d   17m
    kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161   85s
    
  • One nodemaintenancerequest has a DeletedAt time stamp and an active openstack-controller finalizer:

    finalizers:
    - lcm.mirantis.com/openstack-controller.nodemaintenancerequest-finalizer
    
  • In the openstack-controller logs, retries are exhausted:

    2022-02-17 18:41:43,317 [ERROR] kopf._core.engines.peering: Request attempt #8 failed; will retry: PATCH https://10.232.0.1:443/apis/zalando.org/v1/namespaces/openstack/kopfpeerings/openstack-controller.nodemaintenancerequest -> APIServerError('Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', 'reason': 'InternalError', 'details': {'causes': [{'message': 'unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool'}]}, 'code': 500})
    2022-02-17 18:42:50,834 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:47:50,848 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:52:50,853 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:57:50,858 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 19:02:50,862 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    
  • Notification about a successful finish does not exist:

    kopf.objects: Handler 'node_maintenance_request_delete_handler' succeeded.
    

As a workaround, delete the OpenStack Controller pod:

kubectl -n osh-system delete pod -l app.kubernetes.io/name=openstack-operator

[22321] Neutron back end may change from TF to ML2

An update of the MOSK cluster with Tungsten Fabric may hang due to the changed Neutron back end with the following symptoms:

  • The libvirt and nova-compute pods fail to start:

    Entrypoint WARNING: 2022/03/03 08:49:45 entrypoint.go:72:
    Resolving dependency Pod on same host with labels
    map[application:neutron component:neutron-ovs-agent] in namespace openstack failed:
    Found no pods matching labels: map[application:neutron component:neutron-ovs-agent] .
    
  • In the OSDPL network section, the ml2 back end is specified instead of tungstenfabric:

    spec:
      features:
        neutron:
          backend: ml2
    

As a workaround, change the back end option from ml2 to tungstenfabric:

spec:
  features:
    neutron:
      backend: tungstenfabric