Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.18.0 including the Cluster releases 11.2.0 and 7.8.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


MKE

[20651] A cluster deployment or update fails with not ready compose deployments

A managed cluster deployment, attachment, or update to a Cluster release with MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the compose pods flapping (ready > terminating > pending) and with the following error message appearing in logs:

'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
 got 0/0 replicas'
 ready: false
 type: Kubernetes

Workaround:

  1. Disable Docker Content Trust (DCT):

    1. Access the MKE web UI as admin.

    2. Navigate to Admin > Admin Settings.

    3. In the left navigation pane, click Docker Content Trust and disable it.

  2. Restart the affected deployments such as calico-kube-controllers, compose, compose-api, coredns, and so on:

    kubectl -n kube-system delete deployment <deploymentName>
    

    Once done, the cluster deployment or update resumes.

  3. Re-enable DCT.



Bare metal

[24806] The dnsmasq parameters are not applied on multi-rack clusters

During bootstrap of a bare metal management cluster with a multi-rack topology, the dhcp-option=tag parameters are not applied to dnsmasq.conf.

Symptoms:

The dnasmq-controller service contains the following exemplary error message:

KUBECONFIG=kaas-mgmt-kubeconfig kubectl -n kaas logs --tail 50 deployment/dnsmasq -c dnsmasq-controller

...
I0622 09:05:26.898898       8 handler.go:19] Failed to watch Object, kind:'dnsmasq': failed to list *unstructured.Unstructured: the server could not find the requested resource
E0622 09:05:26.899108       8 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.8/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: the server could not find the requested resource
...

Workaround:

Manually update deployment/dnsmasq with the updated image:

KUBECONFIG=kaas-mgmt-kubeconfig kubectl -n kaas set image deployment/dnsmasq dnsmasq-controller=mirantis.azurecr.io/bm/dnsmasq-controller:base-focal-2-18-issue24806-20220618085127

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

  • All Pods are stuck in the Terminating state

  • A new ironic Pod fails to start

  • The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[20736] Region deletion failure after regional deployment failure

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

  • cluster

  • machine

  • baremetalhost

  • baremetalhostprofile

  • l2template

  • subnet

  • ipamhost

  • ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.



Equinix Metal

[16379,23865] Cluster update fails with the FailedMount warning

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

  1. Verify that the description of the pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    
    • <affectedProjectName> is the Container Cloud project name where the pods failed to run

    • <affectedPodName> is a pod name that failed to run in this project

    In the pod description, identify the node name where the pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

  4. On every csi-rbdplugin pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.



Upgrade

[24802] Container Cloud upgrade to 2.18.0 can trigger managed clusters update

On clusters with enabled proxy and the NO_PROXY settings containing localhost/127.0.0.1 or matching the automatically added Container Cloud internal endpoints, the Container Cloud release upgrade from 2.17.0 to 2.18.0 triggers automatic update of managed clusters to the latest available Cluster releases in their respective series.

For the issue workaround, contact Mirantis support.

[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck

Affects Ubuntu-based clusters deployed after Feb 10, 2022

If you deploy an Ubuntu-based cluster using the deprecated Cluster release 7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022, the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck while applying the Deploy state to the cluster machines. The issue affects all cluster types: management, regional, and managed.

To verify that the cluster is affected:

  1. Log in to the Container Cloud web UI.

  2. In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.

  3. Using SSH, log in to the manager or worker node that got stuck while applying the Deploy state and identify the containerd package version:

    containerd --version
    

    If the version is 1.5.9, the cluster is affected.

  4. In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:

    The following packages will be upgraded:
      docker-ee docker-ee-cli
    The following packages will be DOWNGRADED:
      containerd.io
    
    STDERR:
    E: Packages were downgraded and -y was used without --allow-downgrades.
    

Workaround:

Warning

Apply the steps below to the affected nodes one-by-one and only after each consecutive node gets stuck on the Deploy phase with the Ansible log errors. Such sequence ensures that each node is cordon-drained and Docker is properly stopped. Therefore, no workloads are affected.

  1. Using SSH, log in to the first affected node and install containerd 1.5.8:

    apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
    
  2. Wait for Ansible to reconcile. The node should become Ready in several minutes.

  3. Wait for the next node of the cluster to get stuck on the Deploy phase with the Ansible log errors. Only after that, apply the steps above on the next node.

  4. Patch the remaining nodes one-by-one using the steps above.


Container Cloud web UI

[23002] Inability to set a custom value for a predefined node label

During machine creation using the Container Cloud web UI, a custom value for a node label cannot be set.

As a workaround, manually add the value to spec.providerSpec.value.nodeLabels in machine.yaml.


[249] A newly created project does not display in the Container Cloud web UI

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.