Cluster update known issues

This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 21.4.


[17477] StackLight in HA mode is not deployed or cluster update is blocked

Fixed in MOS 21.5

The deployment of new managed clusters using the Cluster release 6.18.0 with StackLight enabled in the HA mode on control plane nodes does not have StackLight deployed. The update of existing clusters with such StackLight configuration that were created using the Cluster release 6.16.0 is blocked with the following error message:

cluster release version upgrade is forbidden: \
Minimum number of worker machines with StackLight label is 3

Workaround:

  1. On the affected managed cluster:

    1. Create a key-value pair that will be used as a unique label on the cluster nodes. In our example, it is forcedRole: stacklight.

      To verify the labels names that already exist on the cluster nodes:

      kubectl get nodes --show-labels
      
    2. Add the new label to the target nodes for StackLight. For example, to the Kubernetes master nodes:

      kubectl label nodes --selector=node-role.kubernetes.io/master forcedRole=stacklight
      
    3. Verify that the new label is added:

      kubectl get nodes --show-labels
      
  2. On the related management cluster:

    1. Configure nodeSelector for the StackLight components by modifying the affected Cluster object:

      kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>
      

      For example:

      spec:
        ...
        providerSpec:
          ...
          value:
            ...
            helmReleases:
              ...
              - name: stacklight
                values:
                  ...
                  nodeSelector:
                    default:
                      forcedRole: stacklight
      
    2. Select from the following options:

      • If you faced the issue during a managed cluster deployment, skip this step.

      • If you faced the issue during a managed cluster update, wait until all StackLight components resources are recreated on the target nodes with updated node selectors.

        To monitor the cluster status:

        kubectl get cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName> -o jsonpath='{.status.providerStatus.conditions[?(@.type=="StackLight")]}' | jq
        

        In the cluster status, verify that the elasticsearch-master and prometheus-server resources are ready. The process can take up to 30 minutes.

        Example of a negative system response:

        {
          "message": "not ready: statefulSets: stacklight/elasticsearch-master got 2/3 replicas",
          "ready": false,
          "type": "StackLight"
        }
        
  3. In the Container Cloud web UI, add a fake StackLight label to any 3 worker nodes to satisfy the deployment requirement as described in Mirantis Container Cloud Operations Guide: Create a machine using web UI. Eventually, StackLight will be still placed on the target nodes with the forcedRole: stacklight label.

    Once done, the StackLight deployment or update proceeds


[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error

Affects only MOS 21.4

An update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0 may fail with the following exemplary error message:

Cluster data status: conditions:
- message: 'Helm charts are not installed(upgraded) yet. Not ready releases: descheduler.'
  ready: false
  type: Helm

The issue may affect the descheduler and metrics-server Helm releases.

As a workaround, run helm uninstall descheduler or helm uninstall metrics-server and wait for Helm controller to recreate the affected release.


[16987] Сluster update fails at Ceph CSI pod eviction

An update of a MOS managed cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.

Workaround:

  1. Scale the affected StatefulSet of the pod that fails to init down to 0 replicas. If it is the DaemonSet such as nova-compute, it must not be scheduled on the affected node.

  2. On every csi-rbdplugin pod, search for stuck csi-vol:

    rbd device list | grep <csi-vol-uuid>
    
  3. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    
  4. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  5. Scale the affected StatefulSet back to the original number of replicas and until its state is Running. If it is a DaemonSet, run the pod on the affected node again.


[17115] Cluster update does not change releaseRefs in Cluster object for Ceph

Fixed in MOS 21.5

During the MOS managed cluster update from the Cluster release 6.16.0 to 6.18.0, the status.providerStatus.releaseRefs.previous.name field in the Cluster object does not change.

Workaround:

  1. In the clusterworkloadlock CRD, remove the subresources section:

    kubectl edit crd clusterworkloadlocks.lcm.mirantis.com
    # remove here 'subresources' section:
    spec:
       versions:
       - name: v1alpha1
         subresources:
           status: {}
    
  2. Obtain clusterRelease from the ceph-controller settings ConfigMap:

    kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.clusterRelease}'
    
  3. Create a ceph-cwl.yaml file with Ceph ClusterWorkloadLock:

    apiVersion: lcm.mirantis.com/v1alpha1
    kind: ClusterWorkloadLock
    metadata:
      name: ceph-clusterworkloadlock
    spec:
      controllerName: ceph
    status:
      state: inactive
      release: <clusterRelease> # from the previous step
    

    Substitute <clusterRelease> with clusterRelease obtained in the previous step.

  4. Apply the resource:

    kubectl apply -f ceph-cwl.yaml
    
  5. Verify that the lock has been created:

    kubectl get clusterworkloadlock ceph-clusterworkloadlock -o yaml
    

[17038] Cluster update may fail with TimeoutError

Affects only MOS 21.4

A MOS managed cluster update from the Cluster version 6.16.0 to 6.18.0 may fail with the Timeout waiting for pods statuses timeout error. The error means that pods containers will be not ready and will often restart with OOMKilled as a restart reason. For example:

kubectl describe pod prometheus-server-0 -n stacklight
...
Containers:
  ...
  prometheus-server:
    ...
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 16 Aug 2021 12:47:57 +0400
      Finished:     Mon, 16 Aug 2021 12:58:02 +0400
...

Workaround:

  1. In the cluster object, set clusterSize to medium as described in Mirantis Container Cloud Operations Guide: StackLight configuration parameters.

  2. Wait until the updated resource limits propagate to the prometheus-server StatefulSet object.

  3. Delete the affected prometheus-server pods. For example:

    kubectl delete pods prometheus-server-0 prometheus-server-1 -n stacklight
    

Once done, new pods with updated resource limits will be created automatically.


[15525] HelmBundle controller gets stuck during cluster update

The HelmBundle controller that handles OpenStack releases gets stuck during cluster update and does not apply HelmBundle changes. The issue is caused by an unlimited releases history that increases the amount of RAM consumed by Tiller. The workaround is to manually limit the releases number history to 3.

Workaround:

  1. Remove the old releases:

    1. Clean up releases in the stacklight namespace:

      function cleanup_release_history {
         pattern=$1
         left_items=${2:-3}
         for i in $(kubectl -n stacklight get cm |grep "$pattern" | awk '{print $1}' | sort -V | head -n -${left_items})
         do
           kubectl -n stacklight delete cm $i
         done
      }
      

      For example:

      kubectl -n stacklight get cm |grep "openstack-cinder.v" | awk '{print $1}'
      openstack-cinder.v1
      ...
      openstack-cinder.v50
      openstack-cinder.v51
      cleanup_release_history openstack-cinder.v
      
  2. Fix the releases in the FAILED state:

    1. Connect to one of StackLight Helm controller pods and list the releases in the FAILED state:

      kubectl -n stacklight exec -it stacklight-helm-controller-699cc6949-dtfgr -- sh
      ./helm --host localhost:44134 list
      

      Example of system response:

      # openstack-heat            2313   Wed Jun 23 06:50:55 2021   FAILED   heat-0.1.0-mcp-3860      openstack
      # openstack-keystone        76     Sun Jun 20 22:47:50 2021   FAILED   keystone-0.1.0-mcp-3860  openstack
      # openstack-neutron         147    Wed Jun 23 07:00:37 2021   FAILED   neutron-0.1.0-mcp-3860   openstack
      # openstack-nova            1      Wed Jun 23 07:09:43 2021   FAILED   nova-0.1.0-mcp-3860      openstack
      # openstack-nova-rabbitmq   15     Wed Jun 23 07:04:38 2021   FAILED   rabbitmq-0.1.0-mcp-2728  openstack
      
    2. Determine the reason for a release failure. Typically, this is due to changes in the immutable objects (jobs). For example:

      ./helm --host localhost:44134 history openstack-mariadb
      

      Example of system response:

      REVISION   UPDATED                    STATUS     CHART                   APP VERSION   DESCRIPTION
      173        Thu Jun 17 20:26:14 2021   DEPLOYED   mariadb-0.1.0-mcp-2710                Upgrade complete
      212        Wed Jun 23 07:07:58 2021   FAILED     mariadb-0.1.0-mcp-2728                Upgrade "openstack-mariadb" failed: Job.batch "openstack-...
      213        Wed Jun 23 07:55:22 2021   FAILED     mariadb-0.1.0-mcp-2728                Upgrade "openstack-mariadb" failed: Job.batch "exporter-c...
      
    3. Remove the FAILED job and roll back the release. For example:

      kubectl -n openstack delete job -l application=mariadb
      ./helm --host localhost:44134 rollback openstack-mariadb 213
      
    4. Verify that the release is in the DEPLOYED state. For example:

      ./helm --host localhost:44134 history openstack-mariadb
      
    5. Perform the steps above for all releases in the FAILED state one by one.

  3. Set TILLER_HISTORY_MAX in the StackLight controller to 3:

    kubectl -n stacklight edit deployment stacklight-helm-controller