Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.21.0 including the Cluster releases 11.5.0 and 7.11.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


MKE

[20651] A cluster deployment or update fails with not ready compose deployments

A managed cluster deployment, attachment, or update to a Cluster release with MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the compose pods flapping (ready > terminating > pending) and with the following error message appearing in logs:

'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
 got 0/0 replicas'
 ready: false
 type: Kubernetes

Workaround:

  1. Disable Docker Content Trust (DCT):

    1. Access the MKE web UI as admin.

    2. Navigate to Admin > Admin Settings.

    3. In the left navigation pane, click Docker Content Trust and disable it.

  2. Restart the affected deployments such as calico-kube-controllers, compose, compose-api, coredns, and so on:

    kubectl -n kube-system delete deployment <deploymentName>
    

    Once done, the cluster deployment or update resumes.

  3. Re-enable DCT.



Bare metal

[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods

Fixed in 11.6.0

Deployment of a regional cluster based on bare metal or Equinix Metal with private networking fails with mcc-cache Pods being stuck in the CrashLoopBackOff status of restarts.

As a workaround, remove failed mcc-cache Pods to restart them automatically. For example:

kubectl -n kaas delete pod mcc-cache-0

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[20736] Region deletion failure after regional deployment failure

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

  • cluster

  • machine

  • baremetalhost

  • baremetalhostprofile

  • l2template

  • subnet

  • ipamhost

  • ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.



Equinix Metal with private networking

[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods

Fixed in 11.6.0

Deployment of a regional cluster based on bare metal or Equinix Metal with private networking fails with mcc-cache Pods being stuck in the CrashLoopBackOff status of restarts.

As a workaround, remove failed mcc-cache Pods to restart them automatically. For example:

kubectl -n kaas delete pod mcc-cache-0

vSphere

[26070] RHEL system cannot be registered in Red Hat portal over MITM proxy

Deployment of RHEL machines using the Red Hat portal registration, which requires user and password credentials, over MITM proxy fails while building the virtual machines template with the following error:

Unable to verify server's identity: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:618)

The Container Cloud deployment gets stuck while applying the RHEL license to machines with the same error in the lcm-agent logs.

As a workaround, use the internal Red Hat Satellite server that a VM can access directly without a MITM proxy.


LCM

[5782] Manager machine fails to be deployed during node replacement

Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)

During replacement of a manager machine, the following problems may occur:

  • The system adds the node to Docker swarm but not to Kubernetes

  • The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

  1. Delete the failed node.

  2. Wait for the MKE cluster to become healthy. To monitor the cluster status:

    1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.

    2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.

  3. Deploy a new node.

[5568] The calico-kube-controllers Pod fails to clean up resources

Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

  • The calico-kube-controllers Pod fails to clean up resources associated with the deleted node

  • The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update

During update of a Container Cloud cluster of any type, if the MKE minor version is updated from 3.4.x to 3.5.x, access to the cluster using the existing kubeconfig fails with the You must be logged in to the server (Unauthorized) error due to OIDC settings being reconfigured.

As a workaround, during the cluster update process, use the admin kubeconfig instead of the existing one. Once the update completes, you can use the existing cluster kubeconfig again.

To obtain the admin kubeconfig:

kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml

If the related cluster is regional, replace <pathToMgmtKubeconfig> with <pathToRegionalKubeconfig>.

[27192] Failure to accept new connections by ‘portforward-controller’

Fixed in 11.6.0 and 12.7.0

During bootstrap of a management or regional cluster of any type, portforward-controller ends accepting new connections after receiving the Accept error: “EOF” error. Hence, nothing is copied between clients.

The workaround below applies only if machines are stuck in the Provision state. Otherwise, contact Mirantis support to further assess the issue.

Workaround:

  1. Verify that machines are stuck in the Provision state for up to 20 minutes or more. For example:

    kubectl --kubeconfig <kindKubeconfigPath> get machines -o wide
    
  2. Verify whether the portforward-controller Pod logs contain {{Accept error: “EOF”}} and {{Stopped forwarding}}:

    kubectl --kubeconfig <kindKubeconfigPath> -n kaas logs -lapp.kubernetes.io/name=portforward-controller | grep 'Accept error: "EOF"'
    
    kubectl --kubeconfig <kindKubeconfigPath> -n kaas logs -lapp.kubernetes.io/name=portforward-controller | grep 'Stopped forwarding'
    
  3. Select from the following options:

    • If the errors mentioned in the previous step are present:

      1. Restart the portforward-controller Deployment:

        kubectl --kubeconfig <kindKubeconfigPath> -n kaas rollout restart deploy portforward-controller
        
      2. Monitor the states of machines and the portforward-controller Pod logs. If the errors recur, restart the portforward-controller Deployment again.

    • If the errors mentioned in the previous step are not present, contact Mirantis support to further assess the issue.


StackLight

[29329] Recreation of the Patroni container replica is stuck

Fixed in 11.7.0 and 12.7.0

During an update of a Container Cloud cluster of any type, recreation of the Patroni container replica is stuck in the degraded state due to the liveness probe killing the container that runs the pg_rewind procedure. The issue affects clusters on which the pg_rewind procedure takes more time than the full cycle of the liveness probe.

The sample logs of the affected cluster:

INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=-6
INFO:  stdout=
INFO:  stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0     LOG:  database system was interrupted; last known up at 2023-01-10 17:00:59 GMT
[64]: [2-1] 63be8d72.40 0  LOG:  could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192
[64]: [3-1] 63be8d72.40 0  LOG:  invalid primary checkpoint record
[64]: [4-1] 63be8d72.40 0  PANIC:  could not locate a valid checkpoint record

Workaround:

For the affected replica and PVC, run:

kubectl delete persistentvolumeclaim/storage-volume-patroni-<replica-id> -n stacklight

kubectl delete pod/patroni-<replica-id> -n stacklight

[28526] CPU throttling for ‘kaas-exporter’ blocking metric collection

Fixed in 11.6.0 and 12.7.0

A low CPU limit 100m for kaas-exporter blocks metric collection.

As a workaround, increase the CPU limit for kaas-exporter to 500m on the management cluster in the spec:providerSpec:value:kaas:management:helmReleases: section as described in Limits for management cluster components.

[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM

Fixed in 11.7.0 and 12.7.0

On the baremetal-based management clusters, the restarts count of the metric-collector Pod is increased in time with reason: OOMKilled in the containerStatuses of the metric-collector Pod. Only clusters with HTTP proxy enabled are affected.

Such behavior is expected. Therefore, disregard these restarts.

[28134] Failure to update a cluster with nodes in the ‘Prepare’ state

Fixed in 11.6.0 and 12.7.0

A Container Cloud cluster of any type fails to update with nodes being stuck in the Prepare state and the following example error in Conditions of the affected machine:

Error: error when evicting pods/"patroni-13-2" -n "stacklight": global timeout reached: 10m0s

Other symptoms of the issue are as follows:

  • One of the Patroni Pods has 2/3 of containers ready. For example:

    kubectl get po -n stacklight -l app=patroni
    
    NAME           READY   STATUS    RESTARTS   AGE
    patroni-13-0   3/3     Running   0          32h
    patroni-13-1   3/3     Running   0          38h
    patroni-13-2   2/3     Running   0          38h
    
  • The patroni-patroni-exporter container from the affected Pod is not ready. For example:

    kubectl get pod/patroni-13-2 -n stacklight -o jsonpath='{.status.containerStatuses[?(@.name=="patroni-patroni-exporter")].ready}'
    
    false
    

As a workaround, restart the patroni-patroni-exporter container of the affected Patroni Pod:

kubectl exec <affectedPatroniPodName> -n stacklight -c patroni-patroni-exporter -- kill 1

For example:

kubectl exec patroni-13-2 -n stacklight -c patroni-patroni-exporter -- kill 1

[27732-1] OpenSearch PVC size custom settings are dismissed during deployment

Fixed in 11.6.0 and 12.7.0

The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is overwritten by logging.persistentVolumeClaimSize during deployment of a Container Cloud cluster of any type and is set to the default 30Gi.

Note

This issue does not block the OpenSearch cluster operations if the default retention time is set. The default setting is usually enough for the capacity size of this cluster.

The issue may affect the following Cluster releases:

  • 11.2.0 - 11.5.0

  • 7.8.0 - 7.11.0

  • 8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)

  • 10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)

  • 13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)

To verify that the cluster is affected:

Note

In the commands below, substitute parameters enclosed in angle brackets to match the affected cluster values.

kubectl --kubeconfig=<managementClusterKubeconfigPath> \
-n <affectedClusterProjectName> \
get cluster <affectedClusterName> \
-o=jsonpath='{.spec.providerSpec.value.helmReleases[*].values.elasticsearch.persistentVolumeClaimSize}' | xargs echo config size:


kubectl --kubeconfig=<affectedClusterKubeconfigPath> \
-n stacklight get pvc -l 'app=opensearch-master' \
-o=jsonpath="{.items[*].status.capacity.storage}" | xargs echo capacity sizes:
  • The cluster is not affected if the configuration size value matches or is less than any capacity size. For example:

    config size: 30Gi
    capacity sizes: 30Gi 30Gi 30Gi
    
    config size: 50Gi
    capacity sizes: 100Gi 100Gi 100Gi
    
  • The cluster is affected if the configuration size is larger than any capacity size. For example:

    config size: 200Gi
    capacity sizes: 100Gi 100Gi 100Gi
    

Workaround for a new cluster creation:

  1. Select from the following options:

    • For a management or regional cluster, during the bootstrap procedure, open cluster.yaml.template for editing.

    • For a managed cluster, open the Cluster object for editing.

      Caution

      For a managed cluster, use the Container Cloud API instead of the web UI for cluster creation.

  2. In the opened .yaml file, add logging.persistentVolumeClaimSize along with elasticsearch.persistentVolumeClaimSize. For example:

    apiVersion: cluster.k8s.io/v1alpha1
    spec:
    ...
      providerSpec:
        value:
        ...
          helmReleases:
          - name: stacklight
            values:
              elasticsearch:
                persistentVolumeClaimSize: 100Gi
              logging:
                enabled: true
                persistentVolumeClaimSize: 100Gi
    
  3. Continue the cluster deployment. The system will use the custom value set in logging.persistentVolumeClaimSize.

    Caution

    If elasticsearch.persistentVolumeClaimSize is absent in the .yaml file, the Admission Controller blocks the configuration update.

Workaround for an existing cluster:

Caution

During the application of the below workarounds, a short outage of OpenSearch and its dependent components may occur with the following alerts firing on the cluster. This behavior is expected. Therefore, disregard these alerts.

StackLight alerts list firing during cluster update

Cluster size and outage probability level

Alert name

Label name and component

Any cluster with high probability

KubeStatefulSetOutage

statefulset=opensearch-master

KubeDeploymentOutage

  • deployment=opensearch-dashboards

  • deployment=metricbeat

Large cluster with average probability

KubePodsNotReady Removed in 17.0.0, 16.0.0, and 14.1.0

  • created_by_name="opensearch-master*"

  • created_by_name="opensearch-dashboards*"

  • created_by_name="metricbeat-*"

OpenSearchClusterStatusWarning

n/a

OpenSearchNumberOfPendingTasks

n/a

OpenSearchNumberOfInitializingShards

n/a

OpenSearchNumberOfUnassignedShards Removed in 2.27.0 (17.2.0 and 16.2.0)

n/a

Any cluster with low probability

KubeStatefulSetReplicasMismatch

statefulset=opensearch-master

KubeDeploymentReplicasMismatch

  • deployment=opensearch-dashboards

  • deployment=metricbeat

StackLight in HA mode with LVP provisioner for OpenSearch PVCs

Warning

After applying this workaround, the existing log data will be lost. Therefore, if required, migrate log data to a new persistent volume (PV).

  1. Move the existing log data to a new PV, if required.

  2. Increase the disk size for local volume provisioner (LVP).

  3. Scale down the opensearch-master StatefulSet with dependent resources to 0 and disable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 0 statefulset opensearch-master
    
    kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 0 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
    
  4. Recreate the opensearch-master StatefulSet with the updated disk size.

    kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <pvcSize>/g' > opensearch-master.yaml
    
    kubectl -n stacklight delete statefulset opensearch-master
    
    kubectl create -f opensearch-master.yaml
    

    Replace <pvcSize> with the elasticsearch.persistentVolumeClaimSize value.

  5. Delete existing PVCs:

    kubectl delete pvc -l 'app=opensearch-master' -n stacklight
    

    Warning

    This command removes all existing logs data from PVCs.

  6. In the Cluster configuration, set the same logging.persistentVolumeClaimSize as the size of elasticsearch.persistentVolumeClaimSize. For example:

    apiVersion: cluster.k8s.io/v1alpha1
    kind: Cluster
    spec:
    ...
      providerSpec:
        value:
        ...
          helmReleases:
          - name: stacklight
            values:
              elasticsearch:
                persistentVolumeClaimSize: 100Gi
              logging:
                enabled: true
                persistentVolumeClaimSize: 100Gi
    
  7. Scale up the opensearch-master StatefulSet with dependent resources and enable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 3 statefulset opensearch-master
    
    sleep 100
    
    kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 1 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
    
StackLight in non-HA mode with an expandable StorageClass for OpenSearch PVCs

Note

To verify whether a StorageClass is expandable:

kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'

A positive system response is allowVolumeExpansion: true. A negative system response is blank or false.

  1. Scale down the opensearch-master StatefulSet with dependent resources to 0 and disable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 0 statefulset opensearch-master
    
    kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 0 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
    
  2. Recreate the opensearch-master StatefulSet with the updated disk size.

    kubectl -n stacklight get statefulset opensearch-master -o yaml | sed 's/storage: 30Gi/storage: <pvc_size>/g' > opensearch-master.yaml
    
    kubectl -n stacklight delete statefulset opensearch-master
    
    kubectl create -f opensearch-master.yaml
    

    Replace <pvcSize> with the elasticsearch.persistentVolumeClaimSize value.

  3. Patch the PVCs with the new elasticsearch.persistentVolumeClaimSize value:

    kubectl -n stacklight patch pvc opensearch-master-opensearch-master-0 -p  '{ "spec": { "resources": { "requests": { "storage": "<pvc_size>" }}}}'
    

    Replace <pvcSize> with the elasticsearch.persistentVolumeClaimSize value.

  4. In the Cluster configuration, set logging.persistentVolumeClaimSize the same as the size of elasticsearch.persistentVolumeClaimSize. For example:

     apiVersion: cluster.k8s.io/v1alpha1
     kind: Cluster
     spec:
     ...
       providerSpec:
         value:
         ...
           helmReleases:
           - name: stacklight
             values:
               elasticsearch:
                 persistentVolumeClaimSize: 100Gi
               logging:
                 enabled: true
                 persistentVolumeClaimSize: 100Gi
    
  5. Scale up the opensearch-master StatefulSet with dependent resources to 1 and enable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 1 statefulset opensearch-master
    
    sleep 100
    
    kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 1 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
    
StackLight in non-HA mode with a non-expandable StorageClass and no LVP for OpenSearch PVCs

Warning

After applying this workaround, the existing log data will be lost. Depending on your custom provisioner, you may find a third-party tool, such as pv-migrate, that provides a possibility to copy all data from one PV to another.

If data loss is acceptable, proceed with the workaround below.

Note

To verify whether a StorageClass is expandable:

kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'

A positive system response is allowVolumeExpansion: true. A negative system response is blank or false.

  1. Scale down the opensearch-master StatefulSet with dependent resources to 0 and disable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 0 statefulset opensearch-master
    
    kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 0 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'
    
  2. Recreate the opensearch-master StatefulSet with the updated disk size:

    kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <<pvc_size>>/g' > opensearch-master.yaml
    
    kubectl -n stacklight delete statefulset opensearch-master
    
    kubectl create -f opensearch-master.yaml
    

    Replace <pvcSize> with the elasticsearch.persistentVolumeClaimSize value.

  3. Delete existing PVCs:

    kubectl delete pvc -l 'app=opensearch-master' -n stacklight
    

    Warning

    This command removes all existing logs data from PVCs.

  4. In the Cluster configuration, set logging.persistentVolumeClaimSize to the same value as the size of the elasticsearch.persistentVolumeClaimSize parameter. For example:

     apiVersion: cluster.k8s.io/v1alpha1
     kind: Cluster
     spec:
     ...
       providerSpec:
         value:
         ...
           helmReleases:
           - name: stacklight
             values:
               elasticsearch:
                 persistentVolumeClaimSize: 100Gi
               logging:
                 enabled: true
                 persistentVolumeClaimSize: 100Gi
    
  5. Scale up the opensearch-master StatefulSet with dependent resources to 1 and enable the elasticsearch-curator CronJob:

    kubectl -n stacklight scale --replicas 1 statefulset opensearch-master
    
    sleep 100
    
    kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 1 deployment metricbeat
    
    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'
    

[27732-2] Custom settings for ‘elasticsearch.logstashRetentionTime’ are dismissed

Fixed in 11.6.0 and 12.7.0

Custom settings for the deprecated elasticsearch.logstashRetentionTime parameter are overwritten by the default setting set to 1 day.

The issue may affect the following Cluster releases with enabled elasticsearch.logstashRetentionTime:

  • 11.2.0 - 11.5.0

  • 7.8.0 - 7.11.0

  • 8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)

  • 10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)

  • 13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)

As a workaround, in the Cluster object, replace elasticsearch.logstashRetentionTime with elasticsearch.retentionTime that was implemented to replace the deprecated parameter. For example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
spec:
  ...
  providerSpec:
    value:
    ...
      helmReleases:
      - name: stacklight
        values:
          elasticsearch:
            retentionTime:
              logstash: 10
              events: 10
              notifications: 10
          logging:
            enabled: true

For the StackLight configuration procedure and parameters description, refer to Configure StackLight.

[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.

On a managed cluster, the StackLight pods may get stuck with the Pod predicate NodeAffinity failed error in the pod status. The issue may occur if the StackLight node label was added to one machine and then removed from another one.

The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.

As a workaround, remove the stuck pods:

kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>

Storage

[28783] Ceph conditon stuck in absence of Ceph cluster secrets info

Fixed in 11.6.0 and 12.7.0

Ceph conditon gets stuck in absence of the Ceph cluster secrets information. The observed behaviour can be found on the MOSK 22.3 clusters running on top of Container Cloud 2.21.

The list of the symptoms includes:

  • The Cluster object contains the following condition:

    Failed to configure Ceph cluster: ceph cluster status info is not \
    updated at least for 5 minutes, ceph cluster secrets info is not available yet
    
  • The ceph-kcc-controller logs from the kaas namespace contain the following loglines:

    2022-11-30 19:39:17.393595 E | ceph-spec: failed to update cluster condition to \
    {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully \
    LastHeartbeatTime:2022-11-30 19:39:17.378401993 +0000 UTC m=+2617.717554955 \
    LastTransitionTime:2022-05-16 16:14:37 +0000 UTC}. failed to update object \
    "rook-ceph/rook-ceph" status: Operation cannot be fulfilled on \
    cephclusters.ceph.rook.io "rook-ceph": the object has been modified; please \
    apply your changes to the latest version and try again
    

Workaround:

  1. Edit KaaSCephCluster of the affected managed cluster:

    kubectl -n <managedClusterProject> edit kaascephcluster
    

    Substitute <managedClusterProject> with the corresponding managed cluster namespace.

  2. Define the version parameter in the KaaSCephCluster spec:

    spec:
      cephClusterSpec:
        version: 15.2.13
    

    Note

    Starting from MOSK 22.4, the Ceph cluster version updates to 15.2.17. Therefore, remove the version parameter definition from KaaSCephCluster after the managed cluster update.

    Save the updated KaaSCephCluster spec.

  3. Find the MiraCeph Custom Resource on a managed cluster and copy all annotations starting with meta.helm.sh:

    kubectl --kubeconfig <managedClusterKubeconfig> get crd miracephs.lcm.mirantis.com -o yaml
    

    Substitute <managedClusterKubeconfig> with a corresponding managed cluster kubeconfig.

    Example of a system output:

    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      annotations:
        controller-gen.kubebuilder.io/version: v0.6.0
        # save all annotations with "meta.helm.sh" somewhere
        meta.helm.sh/release-name: ceph-controller
        meta.helm.sh/release-namespace: ceph
    ...
    
  4. Create the miracephsecretscrd.yaml file and fill it with the following template:

    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      annotations:
        controller-gen.kubebuilder.io/version: v0.6.0
        <insert all "meta.helm.sh" annotations here>
      labels:
        app.kubernetes.io/managed-by: Helm
      name: miracephsecrets.lcm.mirantis.com
    spec:
      conversion:
        strategy: None
      group: lcm.mirantis.com
      names:
        kind: MiraCephSecret
        listKind: MiraCephSecretList
        plural: miracephsecrets
        singular: miracephsecret
      scope: Namespaced
      versions:
        - name: v1alpha1
          schema:
            openAPIV3Schema:
              description: MiraCephSecret aggregates secrets created by Ceph
              properties:
                apiVersion:
                  type: string
                kind:
                  type: string
                metadata:
                  type: object
                status:
                  properties:
                    lastSecretCheck:
                      type: string
                    lastSecretUpdate:
                      type: string
                    messages:
                      items:
                        type: string
                      type: array
                    state:
                      type: string
                  type: object
              type: object
          served: true
          storage: true
    

    Insert the copied meta.helm.sh annotations to the metadata.annotations section of the template.

  5. Apply miracephsecretscrd.yaml on the managed cluster:

    kubectl --kubeconfig <managedClusterKubeconfig> apply -f miracephsecretscrd.yaml
    

    Substitute <managedClusterKubeconfig> with a corresponding managed cluster kubeconfig.

  6. Obtain the MiraCeph name from the managed cluster:

    kubectl --kubeconfig <managedClusterKubeconfig> -n ceph-lcm-mirantis get miraceph -o name
    

    Substitute <managedClusterKubeconfig> with the corresponding managed cluster kubeconfig.

    Example of a system output:

    miraceph.lcm.mirantis.com/rook-ceph
    

    Copy the MiraCeph name after slash, the rook-ceph part from the example above.

  7. Create the mcs.yaml file and fill it with the following template:

    apiVersion: lcm.mirantis.com/v1alpha1
    kind: MiraCephSecret
    metadata:
      name: <miracephName>
      namespace: ceph-lcm-mirantis
    status: {}
    

    Substitute <miracephName> with the MiraCeph name from the previous step.

  8. Apply mcs.yaml on the managed cluster:

    kubectl --kubeconfig <managedClusterKubeconfig> apply -f mcs.yaml
    

    Substitute <managedClusterKubeconfig> with a corresponding managed cluster kubeconfig.

After some delay, the cluster condition will be updated to the health state.

[26441] Cluster update fails with the MountDevice failed for volume warning

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

  1. Verify that the description of the Pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    

    In the command above, replace the following values:

    • <affectedProjectName> is the Container Cloud project name where the Pods failed to run

    • <affectedPodName> is a Pod name that failed to run in the specified project

    In the Pod description, identify the node name where the Pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

  4. On every csi-rbdplugin Pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected Pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.