Cluster update known issues¶
This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 21.4.
[17477] StackLight in HA mode is not deployed or cluster update is blocked
[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error
[17115] Cluster update does not change releaseRefs in Cluster object for Ceph
[17477] StackLight in HA mode is not deployed or cluster update is blocked¶
Fixed in MOS 21.5
The deployment of new managed clusters using the Cluster release 6.18.0 with StackLight enabled in the HA mode on control plane nodes does not have StackLight deployed. The update of existing clusters with such StackLight configuration that were created using the Cluster release 6.16.0 is blocked with the following error message:
cluster release version upgrade is forbidden: \
Minimum number of worker machines with StackLight label is 3
Workaround:
On the affected managed cluster:
Create a key-value pair that will be used as a unique label on the cluster nodes. In our example, it is
forcedRole: stacklight
.To verify the labels names that already exist on the cluster nodes:
kubectl get nodes --show-labels
Add the new label to the target nodes for StackLight. For example, to the Kubernetes master nodes:
kubectl label nodes --selector=node-role.kubernetes.io/master forcedRole=stacklight
Verify that the new label is added:
kubectl get nodes --show-labels
On the related management cluster:
Configure
nodeSelector
for the StackLight components by modifying the affectedCluster
object:kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>
For example:
spec: ... providerSpec: ... value: ... helmReleases: ... - name: stacklight values: ... nodeSelector: default: forcedRole: stacklight
Select from the following options:
If you faced the issue during a managed cluster deployment, skip this step.
If you faced the issue during a managed cluster update, wait until all StackLight components resources are recreated on the target nodes with updated node selectors.
To monitor the cluster status:
kubectl get cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName> -o jsonpath='{.status.providerStatus.conditions[?(@.type=="StackLight")]}' | jq
In the cluster status, verify that the
elasticsearch-master
andprometheus-server
resources are ready. The process can take up to 30 minutes.Example of a negative system response:
{ "message": "not ready: statefulSets: stacklight/elasticsearch-master got 2/3 replicas", "ready": false, "type": "StackLight" }
In the Container Cloud web UI, add a fake StackLight label to any 3 worker nodes to satisfy the deployment requirement as described in Mirantis Container Cloud Operations Guide: Create a machine using web UI. Eventually, StackLight will be still placed on the target nodes with the
forcedRole: stacklight
label.Once done, the StackLight deployment or update proceeds
[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error¶
Affects only MOS 21.4
An update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0 may fail with the following exemplary error message:
Cluster data status: conditions:
- message: 'Helm charts are not installed(upgraded) yet. Not ready releases: descheduler.'
ready: false
type: Helm
The issue may affect the descheduler
and metrics-server
Helm releases.
As a workaround, run helm uninstall descheduler or helm uninstall metrics-server and wait for Helm Controller to recreate the affected release.
[16987] Cluster update fails at Ceph CSI pod eviction¶
Fixed in MOS 22.2
An update of a MOS cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.
Workaround:
Scale the affected
StatefulSet
of the pod that fails to init down to0
replicas. If it is theDaemonSet
such asnova-compute
, it must not be scheduled on the affected node.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:rbd device list | grep <csi-vol-uuid>
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale the affected
StatefulSet
back to the original number of replicas or until its state isRunning
. If it is aDaemonSet
, run the pod on the affected node again.
[17115] Cluster update does not change releaseRefs in Cluster object for Ceph¶
Fixed in MOS 21.5
During an update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0,
the status.providerStatus.releaseRefs.previous.name
field in the Cluster
object does not change.
Workaround:
In the
clusterworkloadlock
CRD, remove thesubresources
section:kubectl edit crd clusterworkloadlocks.lcm.mirantis.com # remove here 'subresources' section: spec: versions: - name: v1alpha1 subresources: status: {}
Obtain
clusterRelease
from theceph-controller
settings ConfigMap:kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.clusterRelease}'
Create a
ceph-cwl.yaml
file with CephClusterWorkloadLock
:apiVersion: lcm.mirantis.com/v1alpha1 kind: ClusterWorkloadLock metadata: name: ceph-clusterworkloadlock spec: controllerName: ceph status: state: inactive release: <clusterRelease> # from the previous step
Substitute
<clusterRelease>
withclusterRelease
obtained in the previous step.Apply the resource:
kubectl apply -f ceph-cwl.yaml
Verify that the lock has been created:
kubectl get clusterworkloadlock ceph-clusterworkloadlock -o yaml
[17038] Cluster update may fail with TimeoutError¶
Affects only MOS 21.4
A MOS cluster update from the Cluster version
6.16.0 to 6.18.0 may fail with the Timeout waiting for pods statuses
timeout error. The error means that pods containers will be not ready
and will often restart with OOMKilled
as a restart reason. For example:
kubectl describe pod prometheus-server-0 -n stacklight
...
Containers:
...
prometheus-server:
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 16 Aug 2021 12:47:57 +0400
Finished: Mon, 16 Aug 2021 12:58:02 +0400
...
Workaround:
In the cluster object, set
clusterSize
tomedium
as described in Mirantis Container Cloud Operations Guide: StackLight configuration parameters.Wait until the updated resource limits propagate to the
prometheus-server
StatefulSet
object.Delete the affected
prometheus-server
pods. For example:kubectl delete pods prometheus-server-0 prometheus-server-1 -n stacklight
Once done, new pods with updated resource limits will be created automatically.