Cluster update known issues

This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 22.1.


[21790] Ceph cluster fails to update due to ‘csi-rbdplugin’ not found

A Ceph cluster fails to update on a managed cluster with the following message:

Failed to configure Ceph cluster: ceph cluster verification is failed:
[Daemonset csi-rbdplugin is not found]

As a workaround, restart the rook-ceph-operator pod:

kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1

[22725] Live migration may fail for instances with deleted images

Fixed in MOSK 22.2

During the update of a MOSK cluster to 22.1, live migration may fail for instances if their images were previously deleted. In this case, the nova-compute pod contains an error message similar to the following one:

2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Command: scp -C -r kaas-node-03ab613d-cf79-4830-ac70-ed735453481a:/var/l
ib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8 /var/lib/nova/instances/_base/e2b6c1622d45071ec8a88a41d07ef785e4dfdfe8
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Exit code: 1
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stdout: ''
2022-03-22 23:55:24.468 11816 ERROR nova.compute.manager [instance: 128cf508-f7f7-4a40-b742-392c8c80fc7d] Stderr: 'ssh: Could not resolve hostname kaas-node-03ab613d-cf79-4830-ac
70-ed735453481a: Name or service not known\r\n'

Workaround:

  • If you have not yet started the managed cluster update, change the nova-compute image by setting the following metadata in the OpenStackDeployment CR:

    spec:
      services:
        compute:
          nova:
            values:
              images:
                tags:
                  nova_compute: mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700
    
  • If you have already started the managed cluster update, manually update the nova-compute container image in the nova-compute DaemonSet to mirantis.azurecr.io/openstack/nova:victoria-bionic-20220324125700.


[16987] Cluster update fails at Ceph CSI pod eviction

Fixed in MOSK 22.2

An update of a MOSK cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.

Workaround:

  1. Scale the affected StatefulSet of the pod that fails to init down to 0 replicas. If it is the DaemonSet such as nova-compute, it must not be scheduled on the affected node.

  2. On every csi-rbdplugin pod, search for stuck csi-vol:

    rbd device list | grep <csi-vol-uuid>
    
  3. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    
  4. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  5. Scale the affected StatefulSet back to the original number of replicas or until its state is Running. If it is a DaemonSet, run the pod on the affected node again.


[18871] MySQL crashes during managed cluster update or instances live migration

Fixed in MOSK 22.2

MySQL may crash when performing instances live migration or during a managed cluster update. After the crash, MariaDB cannot connect to the cluster and gets stuck in the CrashLoopBackOff state.

Workaround:

  1. Verify that other MariaDB replicas are up and running and have joined the cluster:

    1. Verify that at least 2 pods are running and operational (2/2 and Running):

      kubectl -n openstack get pods |grep maria
      

      Example of system response where the pods mariadb-server-0 and mariadb-server-2 are operational:

      mariadb-controller-77b5ff47d5-ndj68   1/1     Running     0          39m
      mariadb-server-0                      2/2     Running     0          39m
      mariadb-server-1                      0/2     Running     0          39m
      mariadb-server-2                      2/2     Running     0          39m
      
    2. Log in to each operational pod and verify that the node is Primary and the cluster size is at least 2. For example:

      mysql -u root -p$MYSQL_DBADMIN_PASSWORD -e "show status;" |grep -e \
      wsrep_cluster_size -e "wsrep_cluster_status" -e "wsrep_local_state_comment"
      

      Example of system response:

      wsrep_cluster_size          2
      wsrep_cluster_status        Primary
      wsrep_local_state_comment   Synced
      
  2. Remove the content of the /var/lib/mysql/* directory:

    kubectl -n openstack exec -it mariadb-server-1  rm -rf /var/lib/mysql/*
    
  3. Restart the MariaDB container:

    kubectl -n openstack delete pod mariadb-server-1
    

[21998] OpenStack Controller may get stuck during the managed cluster update

Fixed in MOSK 22.2

During the MOSK cluster update, the OpenStack Controller may get stuck with the following symptoms:

  • Multiple nodemaintenancerequests exist:

    kubectl get nodemaintenancerequests
    
    NAME                                             AGE
    kaas-node-50a51d95-1e4b-487e-a973-199de400b97d   17m
    kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161   85s
    
  • One nodemaintenancerequest has a DeletedAt time stamp and an active openstack-controller finalizer:

    finalizers:
    - lcm.mirantis.com/openstack-controller.nodemaintenancerequest-finalizer
    
  • In the openstack-controller logs, retries are exhausted:

    2022-02-17 18:41:43,317 [ERROR] kopf._core.engines.peering: Request attempt #8 failed; will retry: PATCH https://10.232.0.1:443/apis/zalando.org/v1/namespaces/openstack/kopfpeerings/openstack-controller.nodemaintenancerequest -> APIServerError('Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool', 'reason': 'InternalError', 'details': {'causes': [{'message': 'unable to unmarshal response in forceLegacy: json: cannot unmarshal number into Go value of type bool'}]}, 'code': 500})
    2022-02-17 18:42:50,834 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:47:50,848 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:52:50,853 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 18:57:50,858 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    2022-02-17 19:02:50,862 [INFO] kopf.objects: Timer 'heartbeat' succeeded.
    
  • Notification about a successful finish does not exist:

    kopf.objects: Handler 'node_maintenance_request_delete_handler' succeeded.
    

As a workaround, delete the OpenStack Controller pod:

kubectl -n osh-system delete pod -l app.kubernetes.io/name=openstack-operator

[22321] Neutron backend may change from TF to ML2

An update of the MOSK cluster with Tungsten Fabric may hang due to the changed Neutron backend with the following symptoms:

  • The libvirt and nova-compute pods fail to start:

    Entrypoint WARNING: 2022/03/03 08:49:45 entrypoint.go:72:
    Resolving dependency Pod on same host with labels
    map[application:neutron component:neutron-ovs-agent] in namespace openstack failed:
    Found no pods matching labels: map[application:neutron component:neutron-ovs-agent] .
    
  • In the OSDPL network section, the ml2 backend is specified instead of tungstenfabric:

    spec:
      features:
        neutron:
          backend: ml2
    

As a workaround, change the backend option from ml2 to tungstenfabric:

spec:
  features:
    neutron:
      backend: tungstenfabric