Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.22.0 including the Cluster release 11.6.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


Bare metal

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[20736] Region deletion failure after regional deployment failure

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

  • cluster

  • machine

  • baremetalhost

  • baremetalhostprofile

  • l2template

  • subnet

  • ipamhost

  • ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.



Equinix Metal with private networking

[29296] Deployment of a managed cluster fails during provisioning

Deployment of a managed cluster based on Equinix Metal with private networking fails during provisioning with the following error:

InspectionError: Failed to obtain hardware details.
Ensure DHCP relay is up and running

Workaround:

  1. In deployment/dnsmasq, udate the image tag version for the dhcpd container to base-alpine-20230118150429:

    kubectl -n kaas edit deployment/dnsmasq
    
  2. In dnsmasq.conf, override the default undionly.kpxe with the ipxe.pxe one:

    kubectl -n kaas edit cm dnsmasq-config
    

    Example of existing configuration:

    dhcp-boot=/undionly.kpxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local
    

    Example of new configuration:

    dhcp-boot=/ipxe.pxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local
    

vSphere

[29647] The ‘Network prepared’ stage of cluster deployment never succeeds

Fixed in 11.7.0

During deployment of a vSphere-based management or regional cluster with IPAM disabled, the Network prepared stage gets stuck in the NotStarted status. The issue does not affect cluster deployment. Therefore, disregard the error message.


LCM

[5782] Manager machine fails to be deployed during node replacement

Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)

During replacement of a manager machine, the following problems may occur:

  • The system adds the node to Docker swarm but not to Kubernetes

  • The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

  1. Delete the failed node.

  2. Wait for the MKE cluster to become healthy. To monitor the cluster status:

    1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.

    2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.

  3. Deploy a new node.

[5568] The calico-kube-controllers Pod fails to clean up resources

Fixed in 2.28.1 (17.2.5, 16.2.5, and 16.3.1)

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

  • The calico-kube-controllers Pod fails to clean up resources associated with the deleted node

  • The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update

During update of a Container Cloud cluster of any type, if the MKE minor version is updated from 3.4.x to 3.5.x, access to the cluster using the existing kubeconfig fails with the You must be logged in to the server (Unauthorized) error due to OIDC settings being reconfigured.

As a workaround, during the cluster update process, use the admin kubeconfig instead of the existing one. Once the update completes, you can use the existing cluster kubeconfig again.

To obtain the admin kubeconfig:

kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml

If the related cluster is regional, replace <pathToMgmtKubeconfig> with <pathToRegionalKubeconfig>.


TLS configuration

[29604] The ‘failed to get kubeconfig’ error during TLS configuration

Fixed in 14.0.0(1) and 15.0.1

When setting a new Transport Layer Security (TLS) certificate for a cluster, the false positive failed to get kubeconfig error may occur on the Waiting for TLS settings to be applied stage. No actions are required. Therefore, disregard the error.

To verify the status of the TLS configuration being applied:

kubectl get cluster <ClusterName> -n <ClusterProjectName> -o jsonpath-as-json="{.status.providerStatus.tls.<Application>}"

Possible values for the <Application> parameter are as follows:

  • keycloak

  • ui

  • cache

  • mke

  • iamProxyAlerta

  • iamProxyAlertManager

  • iamProxyGrafana

  • iamProxyKibana

  • iamProxyPrometheus

Example of system response:

[
    {
        "expirationTime": "2024-01-06T09:37:04Z",
        "hostname": "domain.com",
    }
]

In this example, expirationTime equals the NotAfter field of the server certificate. And the value of hostname contains the configured application name.


StackLight

[30040] OpenSearch is not in the ‘deployed’ status during cluster update

Fixed in 11.7.0 and 12.7.0

Note

The issue may affect the Container Cloud or Cluster release update to the following versions:

  • 2.22.0 for management and regional clusters

  • 11.6.0 for management, regional, and managed clusters

  • 13.2.5, 13.3.5, 13.4.3, and 13.5.2 for attached MKE clusters

The issue does not affect clusters originally deployed since the following Cluster releases: 11.0.0, 8.6.0, 7.6.0.

During cluster update to versions mentioned in the note above, the following OpenSearch-related error may occur on clusters that were originally deployed or attached using Container Cloud 2.15.0 or earlier, before the transition from Elasticsearch to OpenSearch:

The stacklight/opensearch release of the stacklight/stacklight-bundle HelmBundle
reconciled by the stacklight/stacklight-helm-controller Controller
is not in the "deployed" status for the last 15 minutes.

The issue affects clusters with elasticsearch.persistentVolumeClaimSize configured for values other than 30Gi.

To verify that the cluster is affected:

  1. Verify whether the HelmBundleReleaseNotDeployed alert for the opensearch release is firing. If so, the cluster is most probably affected. Otherwise, the cluster is not affected.

  2. Verify the reason of the HelmBundleReleaseNotDeployed alert for the opensearch release:

    kubectl get helmbundle stacklight-bundle -n stacklight -o json | jq '.status.releaseStatuses[] | select(.chart == "opensearch") | .message'
    

    Example system response from the affected cluster:

    Upgrade "opensearch" failed: cannot patch "opensearch-master" with kind StatefulSet: \
    StatefulSet.apps "opensearch-master" is invalid: spec: Forbidden: \
    updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
    

Workaround:

  1. Scale down the opensearch-dashboards and metricbeat resources to 0:

    kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards && \
    kubectl -n stacklight get pods -l app=opensearch-dashboards | awk '{if (NR!=1) {print $1}}' | xargs -r \
    kubectl -n stacklight wait --for=delete --timeout=10m pod
    
    kubectl -n stacklight scale --replicas 0 deployment metricbeat && \
    kubectl -n stacklight get pods -l app=metricbeat | awk '{if (NR!=1) {print $1}}' | xargs -r \
    kubectl -n stacklight wait --for=delete --timeout=10m pod
    

    Wait for the commands in this and next step to complete. The completion time depends on the cluster size.

  2. Disable the elasticsearch-curator CronJob:

    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": true}}'
    
  3. Scale down the opensearch-master StatefulSet:

    kubectl -n stacklight scale --replicas 0 statefulset opensearch-master && \
    kubectl -n stacklight get pods -l app=opensearch-master | awk '{if (NR!=1) {print $1}}' | xargs -r \
    kubectl -n stacklight wait --for=delete --timeout=30m pod
    
  4. Delete the OpenSearch Helm release:

    helm uninstall --no-hooks opensearch -n stacklight
    
  5. Wait up to 5 minutes for Helm Controller to retry the upgrade and properly create the opensearch-master StatefulSet.

    To verify readiness of the opensearch-master Pods:

    kubectl -n stacklight wait --for=condition=Ready --timeout=30m pod -l app=opensearch-master
    

    Example of a successful system response in an HA setup:

    pod/opensearch-master-0 condition met
    pod/opensearch-master-1 condition met
    pod/opensearch-master-2 condition met
    

    Example of a successful system response in an non-HA setup:

    pod/opensearch-master-0 condition met
    
  6. Scale up the opensearch-dashboards and metricbeat resources:

    kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards && \
    kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=opensearch-dashboards
    
    kubectl -n stacklight scale --replicas 1 deployment metricbeat && \
    kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=metricbeat
    
  7. Enable the elasticsearch-curator CronJob:

    kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": false}}'
    

[29329] Recreation of the Patroni container replica is stuck

Fixed in 11.7.0 and 12.7.0

During an update of a Container Cloud cluster of any type, recreation of the Patroni container replica is stuck in the degraded state due to the liveness probe killing the container that runs the pg_rewind procedure. The issue affects clusters on which the pg_rewind procedure takes more time than the full cycle of the liveness probe.

The sample logs of the affected cluster:

INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=-6
INFO:  stdout=
INFO:  stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0     LOG:  database system was interrupted; last known up at 2023-01-10 17:00:59 GMT
[64]: [2-1] 63be8d72.40 0  LOG:  could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192
[64]: [3-1] 63be8d72.40 0  LOG:  invalid primary checkpoint record
[64]: [4-1] 63be8d72.40 0  PANIC:  could not locate a valid checkpoint record

Workaround:

For the affected replica and PVC, run:

kubectl delete persistentvolumeclaim/storage-volume-patroni-<replica-id> -n stacklight

kubectl delete pod/patroni-<replica-id> -n stacklight

[28822] Reference Application triggers alerts during its upgrade

Fixed in 11.7.0 and 12.7.0

On managed clusters with enabled Reference Application, the following alerts are triggered during a managed cluster update from the Cluster release 11.5.0 to 11.6.0 or 7.11.0 to 11.5.0:

  • KubeDeploymentOutage for the refapp Deployment

  • RefAppDown

  • RefAppProbeTooLong

  • RefAppTargetDown

This behavior is expected, no actions are required. Therefore, disregard these alerts.

[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM

Fixed in 11.7.0 and 12.7.0

On the baremetal-based management clusters, the restarts count of the metric-collector Pod is increased in time with reason: OOMKilled in the containerStatuses of the metric-collector Pod. Only clusters with HTTP proxy enabled are affected.

Such behavior is expected. Therefore, disregard these restarts.

[28373] Alerta can get stuck after a failed initialization

Fixed in 11.7.0 and 12.7.0

During creation of a Container Cloud cluster of any type with StackLight enabled, Alerta can get stuck after a failed initialization with only 1 Pod in the READY state. For example:

kubectl get po -n stacklight -l app=alerta

NAME                          READY   STATUS    RESTARTS   AGE
pod/alerta-5f96b775db-45qsz   1/1     Running   0          20h
pod/alerta-5f96b775db-xj4rl   0/1     Running   0          20h

Workaround:

  1. Recreate the affected Alerta Pod:

    kubectl --kubeconfig <affectedClusterKubeconfig> -n stacklight delete pod <stuckAlertaPodName>
    
  2. Verify that both Alerta Pods are in the READY state:

    kubectl get po -n stacklight -l app=alerta
    

[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.

On a managed cluster, the StackLight pods may get stuck with the Pod predicate NodeAffinity failed error in the pod status. The issue may occur if the StackLight node label was added to one machine and then removed from another one.

The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.

As a workaround, remove the stuck pods:

kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>

Ceph

[26441] Cluster update fails with the MountDevice failed for volume warning

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

  1. Verify that the description of the Pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    

    In the command above, replace the following values:

    • <affectedProjectName> is the Container Cloud project name where the Pods failed to run

    • <affectedPodName> is a Pod name that failed to run in the specified project

    In the Pod description, identify the node name where the Pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

  4. On every csi-rbdplugin Pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected Pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.