Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.23.0 including the Cluster release 11.7.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


Bare metal

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[20736] Region deletion failure after regional deployment failure

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

  • cluster

  • machine

  • baremetalhost

  • baremetalhostprofile

  • l2template

  • subnet

  • ipamhost

  • ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.



LCM

[5782] Manager machine fails to be deployed during node replacement

During replacement of a manager machine, the following problems may occur:

  • The system adds the node to Docker swarm but not to Kubernetes

  • The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

  1. Delete the failed node.

  2. Wait for the MKE cluster to become healthy. To monitor the cluster status:

    1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.

    2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.

  3. Deploy a new node.

[5568] The ‘calico-kube-controllers’ Pod fails to clean up resources

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

  • The calico-kube-controllers Pod fails to clean up resources associated with the deleted node

  • The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update

During update of a Container Cloud cluster of any type, if the MKE minor version is updated from 3.4.x to 3.5.x, access to the cluster using the existing kubeconfig fails with the You must be logged in to the server (Unauthorized) error due to OIDC settings being reconfigured.

As a workaround, during the cluster update process, use the admin kubeconfig instead of the existing one. Once the update completes, you can use the existing cluster kubeconfig again.

To obtain the admin kubeconfig:

kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml

If the related cluster is regional, replace <pathToMgmtKubeconfig> with <pathToRegionalKubeconfig>.


TLS configuration

[29604] The ‘failed to get kubeconfig’ error during TLS configuration

When setting a new Transport Layer Security (TLS) certificate for a cluster, the false positive failed to get kubeconfig error may occur on the Waiting for TLS settings to be applied stage. No actions are required. Therefore, disregard the error.

To verify the status of the TLS configuration being applied:

kubectl get cluster <ClusterName> -n <ClusterProjectName> -o jsonpath-as-json="{.status.providerStatus.tls.<Application>}"

Possible values for the <Application> parameter are as follows:

  • keycloak

  • ui

  • cache

  • mke

  • iamProxyAlerta

  • iamProxyAlertManager

  • iamProxyGrafana

  • iamProxyKibana

  • iamProxyPrometheus

Example of system response:

[
    {
        "expirationTime": "2024-01-06T09:37:04Z",
        "hostname": "domain.com",
    }
]

In this example, expirationTime equals the NotAfter field of the server certificate. And the value of hostname contains the configured application name.


Ceph

[26441] Cluster update fails with the ‘MountDevice failed for volume’ warning

Update of a managed cluster based on Equinix Metal with private networking and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

  1. Verify that the description of the Pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    

    In the command above, replace the following values:

    • <affectedProjectName> is the Container Cloud project name where the Pods failed to run

    • <affectedPodName> is a Pod name that failed to run in the specified project

    In the Pod description, identify the node name where the Pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

  4. On every csi-rbdplugin Pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected Pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.

[30635] Ceph ‘pg_autoscaler’ is stuck with the ‘overlapping roots’ error

Due to the upstream Ceph issue occurring since Ceph Pacific, the pg_autoscaler module of Ceph Manager fails with the pool <poolNumber> has overlapping roots error if a Ceph cluster contains a mix of pools with deviceClass either explicitly specified or not specified.

The deviceClass parameter is required for a pool definition in the spec section of the KaaSCephCluster object, but not required for Ceph RADOS Gateway (RGW) and Ceph File System (CephFS). Therefore, if sections for Ceph RGW or CephFS data or metadata pools are defined without deviceClass, then autoscaling of placement groups is disabled on a cluster due to overlapping roots. Overlapping roots imply that Ceph RGW and/or CephFS pools obtained the default crush rule and have no demarcation on a specific class to store data.

Note

If pools for Ceph RGW and CephFS already have deviceClass specified, skip the corresponding steps of the below procedure.

Note

Perform the below procedure on the affected managed cluster using its kubeconfig.

Workaround:

  1. Obtain failureDomain and required replicas for Ceph RGW and/or CephFS pools:

    Note

    If the KaasCephCluster spec section does not contain failureDomain, failureDomain equals host by default to store one replica per node.

    Note

    The types of pools crush rules include:

    • An erasureCoded pool requires the codingChunks + dataChunks number of available units of failureDomain.

    • A replicated pool requires the replicated.size number of available units of failureDomain.

    • To obtain Ceph RGW pools, use the spec.cephClusterSpec.objectStorage.rgw section of the KaaSCephCluster object. For example:

      objectStorage:
        rgw:
          dataPool:
            failureDomain: host
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
          metadataPool:
            failureDomain: host
            replicated:
              size: 3
          gateway:
            allNodes: false
            instances: 3
            port: 80
            securePort: 8443
          name: openstack-store
          preservePoolsOnDelete: false
      

      The dataPool pool requires the sum of codingChunks and dataChunks values representing the number of available units of failureDomain. In the example above, for failureDomain: host, dataPool requires 3 available nodes to store its objects.

      The metadataPool pool requires the replicated.size number of available units of failureDomain. For failureDomain: host, metadataPool requires 3 available nodes to store its objects.

    • To obtain CephFS pools, use the spec.cephClusterSpec.sharedFilesystem.cephFS section of the KaaSCephCluster object. For example:

      sharedFilesystem:
        cephFS:
        - name: cephfs-store
          dataPools:
          - name: default-pool
            replicated:
              size: 3
            failureDomain: host
          - name: second-pool
            erasureCoded:
              dataChunks: 2
              codingChunks: 1
          metadataPool:
            replicated:
              size: 3
            failureDomain: host
          ...
      

      The default-pool and metadataPool pools require the replicated.size number of available units of failureDomain. For failureDomain: host, default-pool requires 3 available nodes to store its objects.

      The second-pool pool requires the sum of codingChunks and dataChunks representing the number of available units of failureDomain. For failureDomain: host, second-pool requires 3 available nodes to store its objects.

  2. Obtain the device class that meets the desired number of required replicas for the defined failureDomain.

  3. Calculate potential data size for Ceph RGW and CephFS pools.

  4. Create the rule-helper script to switch Ceph RGW or CephFS pools to a device usage.

  5. For Ceph RGW, execute the rule-helper script to output the step-by-step instruction and run each step provided in the output manually.

    Note

    The following steps include creation of crush rules with the same parameters as before but with the device class specification and switching of pools to new crush rules.

  6. For CephFS, execute the rule-helper script to output the step-by-step instruction and run each step provided in the output manually.

  7. Verify the pg_autoscaler module after switching deviceClass for all required pools:

    ceph osd pool autoscale-status
    

    The system response must contain all Ceph RGW and CephFS pools.

  8. On the management cluster, edit the KaaSCephCluster object of the corresponding managed cluster by adding the selected device class to the deviceClass parameter of the updated Ceph RGW and CephFS pools:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    You can use this configuration step for further management of Ceph RGW and/or CephFS. It does not impact the existing Ceph cluster configuration.