Cluster update known issues¶

This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 21.4.

[17477] StackLight in HA mode is not deployed or cluster update is blocked
[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error
[16987] Cluster update fails at Ceph CSI pod eviction
[17115] Cluster update does not change releaseRefs in Cluster object for Ceph
[17038] Cluster update may fail with TimeoutError

[17477] StackLight in HA mode is not deployed or cluster update is blocked¶

^{Fixed in MOS 21.5}

The deployment of new managed clusters using the Cluster release 6.18.0 with StackLight enabled in the HA mode on control plane nodes does not have StackLight deployed. The update of existing clusters with such StackLight configuration that were created using the Cluster release 6.16.0 is blocked with the following error message:

cluster release version upgrade is forbidden: \
Minimum number of worker machines with StackLight label is 3

Workaround:

On the affected managed cluster:
1. Create a key-value pair that will be used as a unique label on the cluster nodes. In our example, it is forcedRole: stacklight.
  
  To verify the labels names that already exist on the cluster nodes:
```
kubectl get nodes --show-labels
```
2. Add the new label to the target nodes for StackLight. For example, to the Kubernetes master nodes:
```
kubectl label nodes --selector=node-role.kubernetes.io/master forcedRole=stacklight
```
3. Verify that the new label is added:
```
kubectl get nodes --show-labels
```

On the related management cluster:

Configure nodeSelector for the StackLight components by modifying the affected Cluster object:

kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>

For example:

spec:
  ...
  providerSpec:
    ...
    value:
      ...
      helmReleases:
        ...
        - name: stacklight
          values:
            ...
            nodeSelector:
              default:
                forcedRole: stacklight

Select from the following options:
- If you faced the issue during a managed cluster deployment, skip this step.
- If you faced the issue during a managed cluster update, wait until all StackLight components resources are recreated on the target nodes with updated node selectors.
  
  To monitor the cluster status:
```
kubectl get cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName> -o jsonpath='{.status.providerStatus.conditions[?(@.type=="StackLight")]}' | jq
```
  In the cluster status, verify that the elasticsearch-master and prometheus-server resources are ready. The process can take up to 30 minutes.
  
  Example of a negative system response:
```
{
  "message": "not ready: statefulSets: stacklight/elasticsearch-master got 2/3 replicas",
  "ready": false,
  "type": "StackLight"
}
```

In the Container Cloud web UI, add a fake StackLight label to any 3 worker nodes to satisfy the deployment requirement as described in Mirantis Container Cloud Operations Guide: Create a machine using web UI. Eventually, StackLight will be still placed on the target nodes with the forcedRole: stacklight label.

Once done, the StackLight deployment or update proceeds

[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error¶

^{Affects only MOS 21.4}

An update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0 may fail with the following exemplary error message:

Cluster data status: conditions:
- message: 'Helm charts are not installed(upgraded) yet. Not ready releases: descheduler.'
  ready: false
  type: Helm

The issue may affect the descheduler and metrics-server Helm releases.

As a workaround, run helm uninstall descheduler or helm uninstall metrics-server and wait for Helm Controller to recreate the affected release.

[16987] Cluster update fails at Ceph CSI pod eviction¶

^{Fixed in MOS 22.2}

An update of a MOS cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.

Workaround:

Scale the affected StatefulSet of the pod that fails to init down to 0 replicas. If it is the DaemonSet such as nova-compute, it must not be scheduled on the affected node.
On every csi-rbdplugin pod, search for stuck csi-vol:
```
rbd device list | grep <csi-vol-uuid>
```
Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```

Delete volumeattachment of the affected pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale the affected StatefulSet back to the original number of replicas or until its state is Running. If it is a DaemonSet, run the pod on the affected node again.

[17115] Cluster update does not change releaseRefs in Cluster object for Ceph¶

^{Fixed in MOS 21.5}

During an update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0, the status.providerStatus.releaseRefs.previous.name field in the Cluster object does not change.

Workaround:

In the clusterworkloadlock CRD, remove the subresources section:

kubectl edit crd clusterworkloadlocks.lcm.mirantis.com
# remove here 'subresources' section:
spec:
   versions:
   - name: v1alpha1
     subresources:
       status: {}

Obtain clusterRelease from the ceph-controller settings ConfigMap:

kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.clusterRelease}'

Create a ceph-cwl.yaml file with Ceph ClusterWorkloadLock:

apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterWorkloadLock
metadata:
  name: ceph-clusterworkloadlock
spec:
  controllerName: ceph
status:
  state: inactive
  release: <clusterRelease> # from the previous step

Substitute <clusterRelease> with clusterRelease obtained in the previous step.

Apply the resource:
```
kubectl apply -f ceph-cwl.yaml
```

Verify that the lock has been created:

kubectl get clusterworkloadlock ceph-clusterworkloadlock -o yaml

[17038] Cluster update may fail with TimeoutError¶

^{Affects only MOS 21.4}

A MOS cluster update from the Cluster version 6.16.0 to 6.18.0 may fail with the Timeout waiting for pods statuses timeout error. The error means that pods containers will be not ready and will often restart with OOMKilled as a restart reason. For example:

kubectl describe pod prometheus-server-0 -n stacklight
...
Containers:
  ...
  prometheus-server:
    ...
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 16 Aug 2021 12:47:57 +0400
      Finished:     Mon, 16 Aug 2021 12:58:02 +0400
...

Workaround:

In the cluster object, set clusterSize to medium as described in Mirantis Container Cloud Operations Guide: StackLight configuration parameters.
Wait until the updated resource limits propagate to the prometheus-server StatefulSet object.

Delete the affected prometheus-server pods. For example:

kubectl delete pods prometheus-server-0 prometheus-server-1 -n stacklight

Once done, new pods with updated resource limits will be created automatically.