Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.27.0 including the Cluster releases 17.2.0 and 16.2.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

Bare metal

[47202] Inspection error on bare metal hosts after dnsmasq restart

If the dnsmasq pod is restarted during the bootstrap of newly added nodes, those nodes may fail to undergo inspection. That can result in inspection error in the corresponding BareMetalHost objects.

The issue can occur when:

  • The dnsmasq pod was moved to another node.

  • DHCP subnets were changed, including addition or removal. In this case, the dhcpd container of the dnsmasq pod is restarted.

    Caution

    If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding DHCP subnets until the dnsmasq pod becomes ready, then create BareMetalHost objects.

To verify whether the nodes are affected:

  1. Verify whether the BareMetalHost objects contain the inspection error:

    kubectl get bmh -n <managed-cluster-namespace-name>
    

    Example of system response:

    NAME            STATE         CONSUMER        ONLINE   ERROR              AGE
    test-master-1   provisioned   test-master-1   true                        9d
    test-master-2   provisioned   test-master-2   true                        9d
    test-master-3   provisioned   test-master-3   true                        9d
    test-worker-1   provisioned   test-worker-1   true                        9d
    test-worker-2   provisioned   test-worker-2   true                        9d
    test-worker-3   inspecting                    true     inspection error   19h
    
  2. Verify whether the dnsmasq pod was in Ready state when the inspection of the affected baremetal hosts (test-worker-3 in the example above) was started:

    kubectl -n kaas get pod <dnsmasq-pod-name> -oyaml
    

    Example of system response:

    ...
    status:
      conditions:
      - lastProbeTime: null
        lastTransitionTime: "2024-10-10T15:37:34Z"
        status: "True"
        type: Initialized
      - lastProbeTime: null
        lastTransitionTime: "2024-10-11T07:38:54Z"
        status: "True"
        type: Ready
      - lastProbeTime: null
        lastTransitionTime: "2024-10-11T07:38:54Z"
        status: "True"
        type: ContainersReady
      - lastProbeTime: null
        lastTransitionTime: "2024-10-10T15:37:34Z"
        status: "True"
        type: PodScheduled
      containerStatuses:
      - containerID: containerd://6dbcf2fc4b36ce4c549c9191ab01f72d0236c51d42947675302675e4bfaf4cdf
        image: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq:base-2-28-alpine-20240812132650
        imageID: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq@sha256:3dad3e278add18e69b2608e462691c4823942641a0f0e25e6811e703e3c23b3b
        lastState:
          terminated:
            containerID: containerd://816fcf079cd544acd74e312065de5b5ed4dbf1dc6159fefffff4f644b5e45987
            exitCode: 0
            finishedAt: "2024-10-11T07:38:35Z"
            reason: Completed
            startedAt: "2024-10-10T15:37:45Z"
        name: dhcpd
        ready: true
        restartCount: 2
        started: true
        state:
          running:
            startedAt: "2024-10-11T07:38:37Z"
      ...
    

    In the system response above, the dhcpd container was not ready between "2024-10-11T07:38:35Z" and "2024-10-11T07:38:54Z".

  3. Verify the affected baremetal host. For example:

    kubectl get bmh -n managed-ns test-worker-3 -oyaml
    

    Example of system response:

    ...
    status:
      errorCount: 15
      errorMessage: Introspection timeout
      errorType: inspection error
      ...
      operationHistory:
        deprovision:
          end: null
          start: null
        inspect:
          end: null
          start: "2024-10-11T07:38:19Z"
        provision:
          end: null
          start: null
        register:
          end: "2024-10-11T07:38:19Z"
          start: "2024-10-11T07:37:25Z"
    

    In the system response above, inspection was started at "2024-10-11T07:38:19Z", immediately before the period of the dhcpd container downtime. Therefore, this node is most likely affected by the issue.

To work around the issue, remove the failed BareMetalHost object and create it again:

  1. Remove BareMetalHost object. For example:

    kubectl delete bmh -n managed-ns test-worker-3
    
  2. Verify that the BareMetalHost object is removed:

    kubectl get bmh -n managed-ns test-worker-3
    
  3. Create a BareMetalHost object from the template. For example:

    kubectl create -f bmhc-test-worker-3.yaml
    kubectl create -f bmh-test-worker-3.yaml
    

[46245] Lack of access permissions for HOC and HOCM objects

Fixed in 2.28.0 (17.3.0 and 16.3.0)

When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with the global-admin or operator role obtains the access denied error. For example:

kubectl --kubeconfig ~/.kube/mgmt-config get hocm

Error from server (Forbidden): hostosconfigurationmodules.kaas.mirantis.com is forbidden:
User "2d74348b-5669-4c65-af31-6c05dbedac5f" cannot list resource "hostosconfigurationmodules"
in API group "kaas.mirantis.com" at the cluster scope: access denied

Workaround:

  1. Modify the global-admin role by adding a new entry with the following contents to the rules list:

    kubectl edit clusterroles kaas-global-admin
    
    - apiGroups: [kaas.mirantis.com]
      resources: [hostosconfigurationmodules]
      verbs: ['*']
    
  2. For each Container Cloud project, modify the kaas-operator role by adding a new entry with the following contents to the rules list:

    kubectl -n <projectName> edit roles kaas-operator
    
    - apiGroups: [kaas.mirantis.com]
      resources: [hostosconfigurations]
      verbs: ['*']
    

[41305] DHCP responses are lost between dnsmasq and dhcp-relay pods

Fixed in 2.28.0 (17.3.0 and 16.3.0)

After node maintenance of a management cluster, the newly added nodes may fail to undergo provisioning successfully. The issue relates to new nodes that are in the same L2 domain as the management cluster.

The issue was observed on environments having management cluster nodes configured with a single L2 segment used for all network traffic (PXE and LCM/management networks).

To verify whether the cluster is affected:

Verify whether the dnsmasq and dhcp-relay pods run on the same node in the management cluster:

kubectl -n kaas get pods -o wide| grep -e "dhcp\|dnsmasq"

Example of system response:

dhcp-relay-7d85f75f76-5vdw2   2/2   Running   2 (36h ago)   36h   10.10.0.122     kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>
dnsmasq-8f4b484b4-slhbd       5/5   Running   1 (36h ago)   36h   10.233.123.75   kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>

If this is the case, proceed to the workaround below.

Workaround:

  1. Log in to a node that contains kubeconfig of the affected management cluster.

  2. Make sure that at least two management cluster nodes are schedulable:

    kubectl get node
    

    Example of a positive system response:

    NAME                                             STATUS   ROLES    AGE   VERSION
    kaas-node-bcedb87b-b3ce-46a4-a4ca-ea3068689e40   Ready    master   37h   v1.27.10-mirantis-1
    kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   Ready    master   37h   v1.27.10-mirantis-1
    kaas-node-ad5a6f51-b98f-43c3-91d5-55fed3d0ff21   Ready    master   37h   v1.27.10-mirantis-1
    
  3. Delete the dhcp-relay pod:

    kubectl -n kaas delete pod <dhcp-relay-xxxxx>
    
  4. Verify that the dnsmasq and dhcp-relay pods are scheduled into different nodes:

    kubectl -n kaas get pods -o wide| grep -e "dhcp\|dnsmasq"
    

    Example of a positive system response:

    dhcp-relay-7d85f75f76-rkv03   2/2   Running   0             49s   10.10.0.121     kaas-node-bcedb87b-b3ce-46a4-a4ca-ea3068689e40   <none>   <none>
    dnsmasq-8f4b484b4-slhbd       5/5   Running   1 (37h ago)   37h   10.233.123.75   kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>
    

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.


LCM

[39437] Failure to replace a master node on a Container Cloud cluster

During the replacement of a master node on a cluster of any type, the process may get stuck with Kubelet's NodeReady condition is Unknown in the machine status on the remaining master nodes.

As a workaround, log in on the affected node and run the following command:

docker restart ucp-kubelet

[31186,34132] Pods get stuck during MariaDB operations

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

  1. Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.

  2. Verify that other replicas are up and ready.

  3. Remove the galera.cache file for the affected mariadb-server Pod.

  4. Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[30294] Replacement of a master node is stuck on the calico-node Pod start

During replacement of a master node on a cluster of any type, the calico-node Pod fails to start on a new node that has the same IP address as the node being replaced.

Workaround:

  1. Log in to any master node.

  2. From a CLI with an MKE client bundle, create a shell alias to start calicoctl using the mirantis/ucp-dsinfo image:

    alias calicoctl="\
    docker run -i --rm \
    --pid host \
    --net host \
    -e constraint:ostype==linux \
    -e ETCD_ENDPOINTS=<etcdEndpoint> \
    -e ETCD_KEY_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/key.pem \
    -e ETCD_CA_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/ca.pem \
    -e ETCD_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/cert.pem \
    -v /var/run/calico:/var/run/calico \
    -v /var/lib/docker/volumes/ucp-kv-certs/_data:/var/lib/docker/volumes/ucp-kv-certs/_data:ro \
    mirantis/ucp-dsinfo:<mkeVersion> \
    calicoctl \
    "
    
    alias calicoctl="\
    docker run -i --rm \
    --pid host \
    --net host \
    -e constraint:ostype==linux \
    -e ETCD_ENDPOINTS=<etcdEndpoint> \
    -e ETCD_KEY_FILE=/ucp-node-certs/key.pem \
    -e ETCD_CA_CERT_FILE=/ucp-node-certs/ca.pem \
    -e ETCD_CERT_FILE=/ucp-node-certs/cert.pem \
    -v /var/run/calico:/var/run/calico \
    -v ucp-node-certs:/ucp-node-certs:ro \
    mirantis/ucp-dsinfo:<mkeVersion> \
    calicoctl --allow-version-mismatch \
    "
    

    In the above command, replace the following values with the corresponding settings of the affected cluster:

    • <etcdEndpoint> is the etcd endpoint defined in the Calico configuration file. For example, ETCD_ENDPOINTS=127.0.0.1:12378

    • <mkeVersion> is the MKE version installed on your cluster. For example, mirantis/ucp-dsinfo:3.5.7.

  3. Verify the node list on the cluster:

    kubectl get node
    
  4. Compare this list with the node list in Calico to identify the old node:

    calicoctl get node -o wide
    
  5. Remove the old node from Calico:

    calicoctl delete node kaas-node-<nodeID>
    

[5782] Manager machine fails to be deployed during node replacement

During replacement of a manager machine, the following problems may occur:

  • The system adds the node to Docker swarm but not to Kubernetes

  • The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

  1. Delete the failed node.

  2. Wait for the MKE cluster to become healthy. To monitor the cluster status:

    1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.

    2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.

  3. Deploy a new node.

[5568] The calico-kube-controllers Pod fails to clean up resources

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

  • The calico-kube-controllers Pod fails to clean up resources associated with the deleted node

  • The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

Ceph

[42908] The ceph-exporter pods are present in the Ceph crash list

After a managed cluster update, the ceph-exporter pods are present in the ceph crash ls list while rook-ceph-exporter attempts to obtain the port that is still in use. The issue does not block the managed cluster update. Once the port becomes available, rook-ceph-exporter obtains the port and the issue disappears.

As a workaround, run ceph crash archive-all to remove ceph-exporter pods from the Ceph crash list.

[26441] Cluster update fails with the MountDevice failed for volume warning

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

  1. Verify that the description of the Pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    

    In the command above, replace the following values:

    • <affectedProjectName> is the Container Cloud project name where the Pods failed to run

    • <affectedPodName> is a Pod name that failed to run in the specified project

    In the Pod description, identify the node name where the Pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

  4. On every csi-rbdplugin Pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected Pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.


StackLight

[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state

On High Availability (HA) clusters that use Local Volume Provisioner (LVP), Prometheus and OpenSearch from StackLight may share the same pool of storage. In such configuration, OpenSearch may approach the 85% disk usage watermark due to the combined storage allocation and usage patterns set by the Persistent Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume storage the most.

When the 85% threshold is reached, the affected node is transitioned to the read-only state, preventing shard allocation and causing the OpenSearch cluster state to transition to Warning (Yellow) or Critical (Red).

Caution

The issue and the provided workaround apply only for clusters on which OpenSearch and Prometheus utilize the same storage pool.

To verify that the cluster is affected:

  1. Verify the result of the following formula:

    0.8 × OpenSearch_PVC_Size_GB + Prometheus_PVC_Size_GB > 0.85 × Total_Storage_Capacity_GB
    

    In the formula, define the following values:

    OpenSearch_PVC_Size_GB

    Derived from .Values.elasticsearch.persistentVolumeUsableStorageSizeGB, defaulting to .Values.elasticsearch.persistentVolumeClaimSize if unspecified.

    Prometheus_PVC_Size_GB

    Sourced from .Values.prometheusServer.persistentVolumeClaimSize.

    Total_Storage_Capacity_GB

    Total capacity of the OpenSearch PVCs. For LVP, the capacity of the storage pool. To obtain the total capacity:

    kubectl get pvc -n stacklight -l app=opensearch-master \
    -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.storage
    

    The system response contains multiple outputs, one per opensearch-master node. Select the capacity for the affected node.

    Note

    Convert the values to GB if they are set in different units.

    If the formula result is positive, it is an early indication that the cluster is affected.

  2. Verify whether the OpenSearchClusterStatusWarning or OpenSearchClusterStatusCritical alert is firing. And if so, verify the following:

    1. Log in to the OpenSearch web UI.

    2. In Management -> Dev Tools, run the following command:

      GET _cluster/allocation/explain
      

      The following system response indicates that the corresponding node is affected:

      "explanation": "the node is above the low watermark cluster setting \
      [cluster.routing.allocation.disk.watermark.low=85%], using more disk space \
      than the maximum allowed [85.0%], actual free: [xx.xxx%]"
      

      Note

      The system response may contain even higher watermark percent than 85.0%, depending on the case.

Workaround:

Warning

The workaround implies adjustement of the retention threshold for OpenSearch. And depending on the new threshold, some old logs will be deleted.

  1. Adjust or set .Values.elasticsearch.persistentVolumeUsableStorageSizeGB to a lower value for the affection check formula to be non-positive. For configuration details, see Operations Guide: StackLight configuration - OpenSearch parameters.

    Mirantis also recommends reserving some space for other PVCs using storage from the pool. Use the following formula to calculate the required space:

    persistentVolumeUsableStorageSizeGB =
    0.84 × ((1 - Reserved_Percentage - Filesystem_Reserve) ×
    Total_Storage_Capacity_GB - Prometheus_PVC_Size_GB) /
    0.8
    

    In the formula, define the following values:

    Reserved_Percentage

    A user-defined variable that specifies what percentage of the total storage capacity should not be used by OpenSearch or Prometheus. This is used to reserve space for other components. It should be expressed as a decimal. For example, for 5% of reservation, Reserved_Percentage is 0.05. Mirantis recommends using 0.05 as a starting point.

    Filesystem_Reserve

    Percentage to deduct for filesystems that may reserve some portion of the available storage, which is marked as occupied. For example, for EXT4, it is 5% by default, so the value must be 0.05.

    Prometheus_PVC_Size_GB

    Sourced from .Values.prometheusServer.persistentVolumeClaimSize.

    Total_Storage_Capacity_GB

    Total capacity of the OpenSearch PVCs. For LVP, the capacity of the storage pool. To obtain the total capacity:

    kubectl get pvc -n stacklight -l app=opensearch-master \
    -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.storage
    

    The system response contains multiple outputs, one per opensearch-master node. Select the capacity for the affected node.

    Note

    Convert the values to GB if they are set in different units.

    Calculation of above formula provides a maximum safe storage to allocate for .Values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this formula as a reference for setting .Values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.

  2. Wait up to 15-20 mins for OpenSearch to perform the cleaning.

  3. Verify that the cluster is not affected anymore using the procedure above.

[43164] Rollover policy is not added to indicies created without a policy

Fixed in 2.28.0 (17.3.0 and 16.3.0)

The initial index for the system* and audit* data streams can be created without any policy attached due to race condition.

One of indicators that the cluster is most likely affected is the KubeJobFailed alert firing for the elasticsearch-curator job and one or both of the following errors being present in elasticsearch-curator pods that remain in the Error status:

2024-05-31 13:16:04,459 ERROR   Failed to complete action: delete_indices.  \
<class 'curator.exceptions.FailedExecution'>: Exception encountered.  \
Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. \
Exception: RequestError(400, 'illegal_argument_exception', 'index [.ds-system-000001] \
is the write index for data stream [system] and cannot be deleted')

or

2024-05-31 13:16:04,459 ERROR   Failed to complete action: delete_indices.  \
<class 'curator.exceptions.FailedExecution'>: Exception encountered.  \
Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. \
Exception: RequestError(400, 'illegal_argument_exception', 'index [.ds-audit-000001] \
is the write index for data stream [audit] and cannot be deleted')

If the above mentioned alert and errors are present, an immediate action is required, because it indicates that the corresponding index size has already exceeded the space allocated for the index.

To verify that the cluster is affected:

Caution

Verify and apply the workaround to both index patterns, system and audit, separately.

If one of indices is affected, the second one is most likely affected as well. Although in rare cases, only one index may be affected.

  1. Log in to the opensearch-master-0 Pod:

    kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash
    
  2. Verify whether the rollover policy is attached to the index with the 000001 number:

    • system:

      curl localhost:9200/_plugins/_ism/explain/.ds-system-000001
      
    • audit:

      curl localhost:9200/_plugins/_ism/explain/.ds-audit-000001
      

    If the rollover policy is not attached, the cluster is affected. Examples of system responses in an affected cluster:

     {
      ".ds-system-000001": {
        "index.plugins.index_state_management.policy_id": null,
        "index.opendistro.index_state_management.policy_id": null,
        "enabled": null
      },
      "total_managed_indices": 0
    }
    
    {
      ".ds-audit-000001": {
        "index.plugins.index_state_management.policy_id": null,
        "index.opendistro.index_state_management.policy_id": null,
        "enabled": null
      },
      "total_managed_indices": 0
    }
    

Workaround:

  1. Log in to the opensearch-master-0 Pod:

    kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash
    
  2. Add the policy:

    • system:

      curl -XPOST -H "Content-type: application/json" localhost:9200/_plugins/_ism/add/system* -d'{"policy_id":"system_rollover_policy"}'
      
    • audit:

      curl -XPOST -H "Content-type: application/json" localhost:9200/_plugins/_ism/add/audit* -d'{"policy_id":"audit_rollover_policy"}'
      
  3. Perform again the last step of the cluster verification procedure provided above and make sure that the policy is attached to the index.