Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.18.0 including the Cluster releases 11.2.0 and 7.8.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Troubleshooting Guide.
Note
This section also outlines still valid known issues from previous releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready > terminating > pending) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers,compose,compose-api,coredns, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Bare metal¶
[24806] The dnsmasq parameters are not applied on multi-rack clusters¶
During bootstrap of a bare metal management cluster with a multi-rack topology,
the dhcp-option=tag parameters are not applied to dnsmasq.conf.
Symptoms:
The dnasmq-controller service contains the following exemplary error
message:
KUBECONFIG=kaas-mgmt-kubeconfig kubectl -n kaas logs --tail 50 deployment/dnsmasq -c dnsmasq-controller
...
I0622 09:05:26.898898 8 handler.go:19] Failed to watch Object, kind:'dnsmasq': failed to list *unstructured.Unstructured: the server could not find the requested resource
E0622 09:05:26.899108 8 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.8/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: the server could not find the requested resource
...
Workaround:
Manually update deployment/dnsmasq with the updated image:
KUBECONFIG=kaas-mgmt-kubeconfig kubectl -n kaas set image deployment/dnsmasq dnsmasq-controller=mirantis.azurecr.io/bm/dnsmasq-controller:base-focal-2-18-issue24806-20220618085127
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
TerminatingstateA new
ironicPod fails to startThe related bare metal host is stuck in the
deprovisioningstate
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
clustermachinebaremetalhostbaremetalhostprofilel2templatesubnetipamhostipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
Equinix Metal¶
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume and FailedMount warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMountevents:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>is the Container Cloud project name where the pods failed to run<affectedPodName>is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdpluginlogs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>is a unique RBD volume name.Identify
csiPodNameof the correspondingcsi-rbdplugin:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodNamelogs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSetorDeploymentof the pod that fails to init to0replicas.On every
csi-rbdpluginpod, search for stuckcsi-vol:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>value is a mapped RBD volume that usescsi-vol.Delete
volumeattachmentof the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSetorDeploymentback to the original number of replicas and wait until its state isRunning.
StackLight¶
[27732-1] OpenSearch PVC size custom settings are dismissed during deployment¶
The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is
overwritten by logging.persistentVolumeClaimSize during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi.
Note
This issue does not block the OpenSearch cluster operations if the default retention time is set. The default setting is usually enough for the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle brackets to match the affected cluster values.
kubectl --kubeconfig=<managementClusterKubeconfigPath> \
-n <affectedClusterProjectName> \
get cluster <affectedClusterName> \
-o=jsonpath='{.spec.providerSpec.value.helmReleases[*].values.elasticsearch.persistentVolumeClaimSize}' | xargs echo config size:
kubectl --kubeconfig=<affectedClusterKubeconfigPath> \
-n stacklight get pvc -l 'app=opensearch-master' \
-o=jsonpath="{.items[*].status.capacity.storage}" | xargs echo capacity sizes:
The cluster is not affected if the configuration size value matches or is less than any capacity size. For example:
config size: 30Gi capacity sizes: 30Gi 30Gi 30Gi config size: 50Gi capacity sizes: 100Gi 100Gi 100Gi
The cluster is affected if the configuration size is larger than any capacity size. For example:
config size: 200Gi capacity sizes: 100Gi 100Gi 100Gi
Workaround for a new cluster creation:
Select from the following options:
For a management or regional cluster, during the bootstrap procedure, open
cluster.yaml.templatefor editing.For a managed cluster, open the
Clusterobject for editing.Caution
For a managed cluster, use the Container Cloud API instead of the web UI for cluster creation.
In the opened
.yamlfile, addlogging.persistentVolumeClaimSizealong withelasticsearch.persistentVolumeClaimSize. For example:apiVersion: cluster.k8s.io/v1alpha1 spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Continue the cluster deployment. The system will use the custom value set in
logging.persistentVolumeClaimSize.Caution
If
elasticsearch.persistentVolumeClaimSizeis absent in the.yamlfile, the Admission Controller blocks the configuration update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage of OpenSearch and its dependent components may occur with the following alerts firing on the cluster. This behavior is expected. Therefore, disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level |
Alert name |
Label name and component |
|---|---|---|
Any cluster with high probability |
|
|
|
|
|
Large cluster with average probability |
|
|
|
n/a |
|
|
n/a |
|
|
n/a |
|
|
n/a |
|
Any cluster with low probability |
|
|
|
|
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be lost. Therefore, if required, migrate log data to a new persistent volume (PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the
opensearch-masterStatefulSet with dependent resources to 0 and disable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-masterStatefulSet with the updated disk size.kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <pvcSize>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>with theelasticsearch.persistentVolumeClaimSizevalue.Delete existing PVCs:
kubectl delete pvc -l 'app=opensearch-master' -n stacklight
Warning
This command removes all existing logs data from PVCs.
In the
Clusterconfiguration, set the samelogging.persistentVolumeClaimSizeas the size ofelasticsearch.persistentVolumeClaimSize. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-masterStatefulSet with dependent resources and enable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 3 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
StackLight in non-HA mode with an expandable StorageClass for
OpenSearch PVCs
Note
To verify whether a StorageClass is expandable:
kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'
A positive system response is allowVolumeExpansion: true. A negative
system response is blank or false.
Scale down the
opensearch-masterStatefulSet with dependent resources to 0 and disable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-masterStatefulSet with the updated disk size.kubectl -n stacklight get statefulset opensearch-master -o yaml | sed 's/storage: 30Gi/storage: <pvc_size>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>with theelasticsearch.persistentVolumeClaimSizevalue.Patch the PVCs with the new
elasticsearch.persistentVolumeClaimSizevalue:kubectl -n stacklight patch pvc opensearch-master-opensearch-master-0 -p '{ "spec": { "resources": { "requests": { "storage": "<pvc_size>" }}}}'
Replace
<pvcSize>with theelasticsearch.persistentVolumeClaimSizevalue.In the
Clusterconfiguration, setlogging.persistentVolumeClaimSizethe same as the size ofelasticsearch.persistentVolumeClaimSize. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-masterStatefulSet with dependent resources to1and enable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 1 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be lost. Depending on your custom provisioner, you may find a third-party tool, such as pv-migrate, that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
Note
To verify whether a StorageClass is expandable:
kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'
A positive system response is allowVolumeExpansion: true. A negative
system response is blank or false.
Scale down the
opensearch-masterStatefulSet with dependent resources to 0 and disable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-masterStatefulSet with the updated disk size:kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <<pvc_size>>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>with theelasticsearch.persistentVolumeClaimSizevalue.Delete existing PVCs:
kubectl delete pvc -l 'app=opensearch-master' -n stacklight
Warning
This command removes all existing logs data from PVCs.
In the
Clusterconfiguration, setlogging.persistentVolumeClaimSizeto the same value as the size of theelasticsearch.persistentVolumeClaimSizeparameter. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-masterStatefulSet with dependent resources to1and enable theelasticsearch-curatorCronJob:kubectl -n stacklight scale --replicas 1 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
[27732-2] Custom settings for ‘elasticsearch.logstashRetentionTime’ are dismissed¶
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster object, replace
elasticsearch.logstashRetentionTime with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
spec:
...
providerSpec:
value:
...
helmReleases:
- name: stacklight
values:
elasticsearch:
retentionTime:
logstash: 10
events: 10
notifications: 10
logging:
enabled: true
For the StackLight configuration procedure and parameters description, refer to StackLight configuration procedure.
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
Upgrade¶
[24802] Container Cloud upgrade to 2.18.0 can trigger managed clusters update¶
Affects only Container Cloud 2.18.0
On clusters with enabled proxy and the NO_PROXY settings containing
localhost/127.0.0.1 or matching the automatically added Container Cloud
internal endpoints, the Container Cloud release upgrade from 2.17.0 to 2.18.0
triggers automatic update of managed clusters to the latest available Cluster
releases in their respective series.
For the issue workaround, contact Mirantis support.
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while applying the
Deploystate and identify the containerd package version:containerd --versionIf the version is 1.5.9, the cluster is affected.
In
/var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:The following packages will be upgraded: docker-ee docker-ee-cli The following packages will be DOWNGRADED: containerd.io STDERR: E: Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
Wait for Ansible to reconcile. The node should become
Readyin several minutes.Wait for the next node of the cluster to get stuck on the
Deployphase with the Ansible log errors. Only after that, apply the steps above on the next node.Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶
[23002] Inability to set a custom value for a predefined node label¶
Fixed in 7.11.0, 11.5.0 and 12.5.0
During machine creation using the Container Cloud web UI, a custom value for a node label cannot be set.
As a workaround, manually add the value to
spec.providerSpec.value.nodeLabels in machine.yaml.
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.