Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Cluster update known issues¶
This section lists the cluster update known issues with workarounds for the Mirantis OpenStack for Kubernetes release 21.4.
[17477] StackLight in HA mode is not deployed or cluster update is blocked
[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error
[17115] Cluster update does not change releaseRefs in Cluster object for Ceph
[17477] StackLight in HA mode is not deployed or cluster update is blocked¶
Fixed in MOS 21.5
The deployment of new managed clusters using the Cluster release 6.18.0 with StackLight enabled in the HA mode on control plane nodes does not have StackLight deployed. The update of existing clusters with such StackLight configuration that were created using the Cluster release 6.16.0 is blocked with the following error message:
cluster release version upgrade is forbidden: \
Minimum number of worker machines with StackLight label is 3
Workaround:
On the affected managed cluster:
Create a key-value pair that will be used as a unique label on the cluster nodes. In our example, it is
forcedRole: stacklight.To verify the labels names that already exist on the cluster nodes:
kubectl get nodes --show-labels
Add the new label to the target nodes for StackLight. For example, to the Kubernetes master nodes:
kubectl label nodes --selector=node-role.kubernetes.io/master forcedRole=stacklight
Verify that the new label is added:
kubectl get nodes --show-labels
On the related management cluster:
Configure
nodeSelectorfor the StackLight components by modifying the affectedClusterobject:kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>
For example:
spec: ... providerSpec: ... value: ... helmReleases: ... - name: stacklight values: ... nodeSelector: default: forcedRole: stacklight
Select from the following options:
If you faced the issue during a managed cluster deployment, skip this step.
If you faced the issue during a managed cluster update, wait until all StackLight components resources are recreated on the target nodes with updated node selectors.
To monitor the cluster status:
kubectl get cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName> -o jsonpath='{.status.providerStatus.conditions[?(@.type=="StackLight")]}' | jq
In the cluster status, verify that the
elasticsearch-masterandprometheus-serverresources are ready. The process can take up to 30 minutes.Example of a negative system response:
{ "message": "not ready: statefulSets: stacklight/elasticsearch-master got 2/3 replicas", "ready": false, "type": "StackLight" }
In the Container Cloud web UI, add a fake StackLight label to any 3 worker nodes to satisfy the deployment requirement as described in Create a machine using MOSK management console. Eventually, StackLight will be still placed on the target nodes with the
forcedRole: stacklightlabel.Once done, the StackLight deployment or update proceeds
[17305] Cluster update fails with the ‘Not ready releases: descheduler’ error¶
Affects only MOS 21.4
An update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0 may fail with the following exemplary error message:
Cluster data status: conditions:
- message: 'Helm charts are not installed(upgraded) yet. Not ready releases: descheduler.'
ready: false
type: Helm
The issue may affect the descheduler and metrics-server Helm releases.
As a workaround, run helm uninstall descheduler or helm uninstall metrics-server and wait for Helm Controller to recreate the affected release.
[16987] Cluster update fails at Ceph CSI pod eviction¶
Fixed in MOS 22.2
An update of a MOS cluster may fail with the ceph csi-driver is not evacuated yet, waiting… error during the Ceph CSI pod eviction.
Workaround:
Scale the affected
StatefulSetof the pod that fails to init down to0replicas. If it is theDaemonSetsuch asnova-compute, it must not be scheduled on the affected node.On every
csi-rbdpluginpod, search for stuckcsi-vol:rbd device list | grep <csi-vol-uuid>
Unmap the affected
csi-vol:rbd unmap -o force /dev/rbd<i>
Delete
volumeattachmentof the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale the affected
StatefulSetback to the original number of replicas or until its state isRunning. If it is aDaemonSet, run the pod on the affected node again.
[17115] Cluster update does not change releaseRefs in Cluster object for Ceph¶
Fixed in MOS 21.5
During an update of a MOS cluster from the Cluster release 6.16.0 to 6.18.0,
the status.providerStatus.releaseRefs.previous.name field in the Cluster
object does not change.
Workaround:
In the
clusterworkloadlockCRD, remove thesubresourcessection:kubectl edit crd clusterworkloadlocks.lcm.mirantis.com # remove here 'subresources' section: spec: versions: - name: v1alpha1 subresources: status: {}
Obtain
clusterReleasefrom theceph-controllersettings ConfigMap:kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.clusterRelease}'
Create a
ceph-cwl.yamlfile with CephClusterWorkloadLock:apiVersion: lcm.mirantis.com/v1alpha1 kind: ClusterWorkloadLock metadata: name: ceph-clusterworkloadlock spec: controllerName: ceph status: state: inactive release: <clusterRelease> # from the previous step
Substitute
<clusterRelease>withclusterReleaseobtained in the previous step.Apply the resource:
kubectl apply -f ceph-cwl.yaml
Verify that the lock has been created:
kubectl get clusterworkloadlock ceph-clusterworkloadlock -o yaml
[17038] Cluster update may fail with TimeoutError¶
Affects only MOS 21.4
A MOS cluster update from the Cluster version
6.16.0 to 6.18.0 may fail with the Timeout waiting for pods statuses
timeout error. The error means that pods containers will be not ready
and will often restart with OOMKilled as a restart reason. For example:
kubectl describe pod prometheus-server-0 -n stacklight
...
Containers:
...
prometheus-server:
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 16 Aug 2021 12:47:57 +0400
Finished: Mon, 16 Aug 2021 12:58:02 +0400
...
Workaround:
In the cluster object, set
clusterSizetomediumas described in Operations Guide: StackLight configuration parameters.Wait until the updated resource limits propagate to the
prometheus-serverStatefulSetobject.Delete the affected
prometheus-serverpods. For example:kubectl delete pods prometheus-server-0 prometheus-server-1 -n stacklight
Once done, new pods with updated resource limits will be created automatically.