Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.21.0 including the Cluster releases 11.5.0 and 7.11.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
MKE¶
[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose
pods flapping (ready > terminating > pending
) and with the
following error message appearing in logs:
'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
got 0/0 replicas'
ready: false
type: Kubernetes
Workaround:
Disable Docker Content Trust (DCT):
Access the MKE web UI as admin.
Navigate to Admin > Admin Settings.
In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as
calico-kube-controllers
,compose
,compose-api
,coredns
, and so on:kubectl -n kube-system delete deployment <deploymentName>
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Bare metal¶
[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods¶
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache
Pods being stuck in the
CrashLoopBackOff
status of restarts.
As a workaround, remove failed mcc-cache
Pods to restart them
automatically. For example:
kubectl -n kaas delete pod mcc-cache-0
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic
Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
Terminating
stateA new
ironic
Pod fails to startThe related bare metal host is stuck in the
deprovisioning
state
As a workaround, before deletion of the node running the ironic
Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region
label of the affected
region:
cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
Equinix Metal with private networking¶
[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods¶
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache
Pods being stuck in the
CrashLoopBackOff
status of restarts.
As a workaround, remove failed mcc-cache
Pods to restart them
automatically. For example:
kubectl -n kaas delete pod mcc-cache-0
vSphere¶
[26070] RHEL system cannot be registered in Red Hat portal over MITM proxy¶
Deployment of RHEL machines using the Red Hat portal registration, which requires user and password credentials, over MITM proxy fails while building the virtual machines template with the following error:
Unable to verify server's identity: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:618)
The Container Cloud deployment gets stuck while applying the RHEL license
to machines with the same error in the lcm-agent
logs.
As a workaround, use the internal Red Hat Satellite server that a VM can access directly without a MITM proxy.
LCM¶
[5782] Manager machine fails to be deployed during node replacement¶
Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)
During replacement of a manager machine, the following problems may occur:
The system adds the node to Docker swarm but not to Kubernetes
The node
Deployment
gets stuck with failed RethinkDB health checks
Workaround:
Delete the failed node.
Wait for the MKE cluster to become healthy. To monitor the cluster status:
Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.
Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.
Deploy a new node.
[5568] The calico-kube-controllers Pod fails to clean up resources¶
Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)
During the unsafe
or forced
deletion of a manager machine running the
calico-kube-controllers
Pod in the kube-system
namespace,
the following issues occur:
The
calico-kube-controllers
Pod fails to clean up resources associated with the deleted nodeThe
calico-node
Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers
Pod, cordon and drain the node:
kubectl cordon <nodeName>
kubectl drain <nodeName>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig
fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the admin
kubeconfig
instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig
again.
To obtain the admin
kubeconfig
:
kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml
If the related cluster is regional, replace <pathToMgmtKubeconfig>
with
<pathToRegionalKubeconfig>
.
[27192] Failure to accept new connections by ‘portforward-controller’¶
During bootstrap of a management or regional cluster of any type,
portforward-controller
ends accepting new connections after receiving the
Accept error: “EOF” error. Hence, nothing is copied between clients.
The workaround below applies only if machines are stuck in the Provision
state. Otherwise, contact Mirantis support to further assess the issue.
Workaround:
Verify that machines are stuck in the
Provision
state for up to 20 minutes or more. For example:kubectl --kubeconfig <kindKubeconfigPath> get machines -o wide
Verify whether the
portforward-controller
Pod logs contain{{Accept error: “EOF”}}
and{{Stopped forwarding}}
:kubectl --kubeconfig <kindKubeconfigPath> -n kaas logs -lapp.kubernetes.io/name=portforward-controller | grep 'Accept error: "EOF"' kubectl --kubeconfig <kindKubeconfigPath> -n kaas logs -lapp.kubernetes.io/name=portforward-controller | grep 'Stopped forwarding'
Select from the following options:
If the errors mentioned in the previous step are present:
Restart the
portforward-controller
Deployment:kubectl --kubeconfig <kindKubeconfigPath> -n kaas rollout restart deploy portforward-controller
Monitor the states of machines and the
portforward-controller
Pod logs. If the errors recur, restart theportforward-controller
Deployment again.
If the errors mentioned in the previous step are not present, contact Mirantis support to further assess the issue.
StackLight¶
[29329] Recreation of the Patroni container replica is stuck¶
During an update of a Container Cloud cluster of any type, recreation of the
Patroni container replica is stuck in the degraded state due to the liveness
probe killing the container that runs the pg_rewind
procedure. The issue
affects clusters on which the pg_rewind
procedure takes more time than the
full cycle of the liveness probe.
The sample logs of the affected cluster:
INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=-6
INFO: stdout=
INFO: stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0 LOG: database system was interrupted; last known up at 2023-01-10 17:00:59 GMT
[64]: [2-1] 63be8d72.40 0 LOG: could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192
[64]: [3-1] 63be8d72.40 0 LOG: invalid primary checkpoint record
[64]: [4-1] 63be8d72.40 0 PANIC: could not locate a valid checkpoint record
Workaround:
For the affected replica and PVC, run:
kubectl delete persistentvolumeclaim/storage-volume-patroni-<replica-id> -n stacklight
kubectl delete pod/patroni-<replica-id> -n stacklight
[28526] CPU throttling for ‘kaas-exporter’ blocking metric collection¶
A low CPU limit 100m
for kaas-exporter
blocks metric collection.
As a workaround, increase the CPU limit for kaas-exporter
to 500m
on the management cluster in the
spec:providerSpec:value:kaas:management:helmReleases:
section as
described in Limits for management cluster components.
[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM¶
On the baremetal-based management clusters, the restarts count of the
metric-collector
Pod is increased in time with reason: OOMKilled
in
the containerStatuses
of the metric-collector
Pod. Only clusters with
HTTP proxy enabled are affected.
Such behavior is expected. Therefore, disregard these restarts.
[28134] Failure to update a cluster with nodes in the ‘Prepare’ state¶
A Container Cloud cluster of any type fails to update with nodes being
stuck in the Prepare
state and the following example error in
Conditions
of the affected machine:
Error: error when evicting pods/"patroni-13-2" -n "stacklight": global timeout reached: 10m0s
Other symptoms of the issue are as follows:
One of the Patroni Pods has 2/3 of containers ready. For example:
kubectl get po -n stacklight -l app=patroni NAME READY STATUS RESTARTS AGE patroni-13-0 3/3 Running 0 32h patroni-13-1 3/3 Running 0 38h patroni-13-2 2/3 Running 0 38h
The
patroni-patroni-exporter
container from the affected Pod is not ready. For example:kubectl get pod/patroni-13-2 -n stacklight -o jsonpath='{.status.containerStatuses[?(@.name=="patroni-patroni-exporter")].ready}' false
As a workaround, restart the patroni-patroni-exporter
container of
the affected Patroni Pod:
kubectl exec <affectedPatroniPodName> -n stacklight -c patroni-patroni-exporter -- kill 1
For example:
kubectl exec patroni-13-2 -n stacklight -c patroni-patroni-exporter -- kill 1
[27732-1] OpenSearch PVC size custom settings are dismissed during deployment¶
The OpenSearch elasticsearch.persistentVolumeClaimSize
custom setting is
overwritten by logging.persistentVolumeClaimSize
during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi
.
Note
This issue does not block the OpenSearch cluster operations if the default retention time is set. The default setting is usually enough for the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle brackets to match the affected cluster values.
kubectl --kubeconfig=<managementClusterKubeconfigPath> \
-n <affectedClusterProjectName> \
get cluster <affectedClusterName> \
-o=jsonpath='{.spec.providerSpec.value.helmReleases[*].values.elasticsearch.persistentVolumeClaimSize}' | xargs echo config size:
kubectl --kubeconfig=<affectedClusterKubeconfigPath> \
-n stacklight get pvc -l 'app=opensearch-master' \
-o=jsonpath="{.items[*].status.capacity.storage}" | xargs echo capacity sizes:
The cluster is not affected if the configuration size value matches or is less than any capacity size. For example:
config size: 30Gi capacity sizes: 30Gi 30Gi 30Gi config size: 50Gi capacity sizes: 100Gi 100Gi 100Gi
The cluster is affected if the configuration size is larger than any capacity size. For example:
config size: 200Gi capacity sizes: 100Gi 100Gi 100Gi
Workaround for a new cluster creation:
Select from the following options:
For a management or regional cluster, during the bootstrap procedure, open
cluster.yaml.template
for editing.For a managed cluster, open the
Cluster
object for editing.Caution
For a managed cluster, use the Container Cloud API instead of the web UI for cluster creation.
In the opened
.yaml
file, addlogging.persistentVolumeClaimSize
along withelasticsearch.persistentVolumeClaimSize
. For example:apiVersion: cluster.k8s.io/v1alpha1 spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Continue the cluster deployment. The system will use the custom value set in
logging.persistentVolumeClaimSize
.Caution
If
elasticsearch.persistentVolumeClaimSize
is absent in the.yaml
file, the Admission Controller blocks the configuration update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage of OpenSearch and its dependent components may occur with the following alerts firing on the cluster. This behavior is expected. Therefore, disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level |
Alert name |
Label name and component |
---|---|---|
Any cluster with high probability |
|
|
|
|
|
Large cluster with average probability |
|
|
|
n/a |
|
|
n/a |
|
|
n/a |
|
|
n/a |
|
Any cluster with low probability |
|
|
|
|
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be lost. Therefore, if required, migrate log data to a new persistent volume (PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the
opensearch-master
StatefulSet with dependent resources to 0 and disable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-master
StatefulSet with the updated disk size.kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <pvcSize>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>
with theelasticsearch.persistentVolumeClaimSize
value.Delete existing PVCs:
kubectl delete pvc -l 'app=opensearch-master' -n stacklight
Warning
This command removes all existing logs data from PVCs.
In the
Cluster
configuration, set the samelogging.persistentVolumeClaimSize
as the size ofelasticsearch.persistentVolumeClaimSize
. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-master
StatefulSet with dependent resources and enable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 3 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
StackLight in non-HA mode with an expandable StorageClass for
OpenSearch PVCs
Note
To verify whether a StorageClass is expandable:
kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'
A positive system response is allowVolumeExpansion: true
. A negative
system response is blank or false
.
Scale down the
opensearch-master
StatefulSet with dependent resources to 0 and disable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-master
StatefulSet with the updated disk size.kubectl -n stacklight get statefulset opensearch-master -o yaml | sed 's/storage: 30Gi/storage: <pvc_size>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>
with theelasticsearch.persistentVolumeClaimSize
value.Patch the PVCs with the new
elasticsearch.persistentVolumeClaimSize
value:kubectl -n stacklight patch pvc opensearch-master-opensearch-master-0 -p '{ "spec": { "resources": { "requests": { "storage": "<pvc_size>" }}}}'
Replace
<pvcSize>
with theelasticsearch.persistentVolumeClaimSize
value.In the
Cluster
configuration, setlogging.persistentVolumeClaimSize
the same as the size ofelasticsearch.persistentVolumeClaimSize
. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-master
StatefulSet with dependent resources to1
and enable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 1 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be lost. Depending on your custom provisioner, you may find a third-party tool, such as pv-migrate, that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
Note
To verify whether a StorageClass is expandable:
kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'
A positive system response is allowVolumeExpansion: true
. A negative
system response is blank or false
.
Scale down the
opensearch-master
StatefulSet with dependent resources to 0 and disable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 0 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
Recreate the
opensearch-master
StatefulSet with the updated disk size:kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <<pvc_size>>/g' > opensearch-master.yaml kubectl -n stacklight delete statefulset opensearch-master kubectl create -f opensearch-master.yaml
Replace
<pvcSize>
with theelasticsearch.persistentVolumeClaimSize
value.Delete existing PVCs:
kubectl delete pvc -l 'app=opensearch-master' -n stacklight
Warning
This command removes all existing logs data from PVCs.
In the
Cluster
configuration, setlogging.persistentVolumeClaimSize
to the same value as the size of theelasticsearch.persistentVolumeClaimSize
parameter. For example:apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster spec: ... providerSpec: value: ... helmReleases: - name: stacklight values: elasticsearch: persistentVolumeClaimSize: 100Gi logging: enabled: true persistentVolumeClaimSize: 100Gi
Scale up the
opensearch-master
StatefulSet with dependent resources to1
and enable theelasticsearch-curator
CronJob:kubectl -n stacklight scale --replicas 1 statefulset opensearch-master sleep 100 kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
[27732-2] Custom settings for ‘elasticsearch.logstashRetentionTime’ are dismissed¶
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime
:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster
object, replace
elasticsearch.logstashRetentionTime
with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
spec:
...
providerSpec:
value:
...
helmReleases:
- name: stacklight
values:
elasticsearch:
retentionTime:
logstash: 10
events: 10
notifications: 10
logging:
enabled: true
For the StackLight configuration procedure and parameters description, refer to Configure StackLight.
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed
error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
Storage¶
[28783] Ceph conditon stuck in absence of Ceph cluster secrets info¶
Ceph conditon gets stuck in absence of the Ceph cluster secrets information. The observed behaviour can be found on the MOSK 22.3 clusters running on top of Container Cloud 2.21.
The list of the symptoms includes:
The
Cluster
object contains the following condition:Failed to configure Ceph cluster: ceph cluster status info is not \ updated at least for 5 minutes, ceph cluster secrets info is not available yet
The
ceph-kcc-controller
logs from thekaas
namespace contain the following loglines:2022-11-30 19:39:17.393595 E | ceph-spec: failed to update cluster condition to \ {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully \ LastHeartbeatTime:2022-11-30 19:39:17.378401993 +0000 UTC m=+2617.717554955 \ LastTransitionTime:2022-05-16 16:14:37 +0000 UTC}. failed to update object \ "rook-ceph/rook-ceph" status: Operation cannot be fulfilled on \ cephclusters.ceph.rook.io "rook-ceph": the object has been modified; please \ apply your changes to the latest version and try again
Workaround:
Edit
KaaSCephCluster
of the affected managed cluster:kubectl -n <managedClusterProject> edit kaascephcluster
Substitute
<managedClusterProject>
with the corresponding managed cluster namespace.Define the
version
parameter in theKaaSCephCluster
spec:spec: cephClusterSpec: version: 15.2.13
Note
Starting from MOSK 22.4, the Ceph cluster version updates to 15.2.17. Therefore, remove the
version
parameter definition fromKaaSCephCluster
after the managed cluster update.Save the updated
KaaSCephCluster
spec.Find the
MiraCeph
Custom Resource on a managed cluster and copy all annotations starting withmeta.helm.sh
:kubectl --kubeconfig <managedClusterKubeconfig> get crd miracephs.lcm.mirantis.com -o yaml
Substitute
<managedClusterKubeconfig>
with a corresponding managed clusterkubeconfig
.Example of a system output:
apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.6.0 # save all annotations with "meta.helm.sh" somewhere meta.helm.sh/release-name: ceph-controller meta.helm.sh/release-namespace: ceph ...
Create the
miracephsecretscrd.yaml
file and fill it with the following template:apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.6.0 <insert all "meta.helm.sh" annotations here> labels: app.kubernetes.io/managed-by: Helm name: miracephsecrets.lcm.mirantis.com spec: conversion: strategy: None group: lcm.mirantis.com names: kind: MiraCephSecret listKind: MiraCephSecretList plural: miracephsecrets singular: miracephsecret scope: Namespaced versions: - name: v1alpha1 schema: openAPIV3Schema: description: MiraCephSecret aggregates secrets created by Ceph properties: apiVersion: type: string kind: type: string metadata: type: object status: properties: lastSecretCheck: type: string lastSecretUpdate: type: string messages: items: type: string type: array state: type: string type: object type: object served: true storage: true
Insert the copied
meta.helm.sh
annotations to themetadata.annotations
section of the template.Apply
miracephsecretscrd.yaml
on the managed cluster:kubectl --kubeconfig <managedClusterKubeconfig> apply -f miracephsecretscrd.yaml
Substitute
<managedClusterKubeconfig>
with a corresponding managed clusterkubeconfig
.Obtain the
MiraCeph
name from the managed cluster:kubectl --kubeconfig <managedClusterKubeconfig> -n ceph-lcm-mirantis get miraceph -o name
Substitute
<managedClusterKubeconfig>
with the corresponding managed clusterkubeconfig
.Example of a system output:
miraceph.lcm.mirantis.com/rook-ceph
Copy the
MiraCeph
name after slash, therook-ceph
part from the example above.Create the
mcs.yaml
file and fill it with the following template:apiVersion: lcm.mirantis.com/v1alpha1 kind: MiraCephSecret metadata: name: <miracephName> namespace: ceph-lcm-mirantis status: {}
Substitute
<miracephName>
with theMiraCeph
name from the previous step.Apply
mcs.yaml
on the managed cluster:kubectl --kubeconfig <managedClusterKubeconfig> apply -f mcs.yaml
Substitute
<managedClusterKubeconfig>
with a corresponding managed clusterkubeconfig
.
After some delay, the cluster condition will be updated to the health
state.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim
getting stuck in the Pending
state for the
prometheus-server
StatefulSet and the
MountVolume.MountDevice failed for volume
warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
In the command above, replace the following values:
<affectedProjectName>
is the Container Cloud project name where the Pods failed to run<affectedPodName>
is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain therbd volume mount failed: <csi-vol-uuid> is being used
error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the Pod that fails to0
replicas.On every
csi-rbdplugin
Pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected Pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state becomesRunning
.