Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.15.0 including the Cluster releases 7.5.0 and 5.22.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


MKE

[20651] A cluster deployment or update fails with not ready compose deployments

A managed cluster deployment, attachment, or update to a Cluster release with MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the compose pods flapping (ready > terminating > pending) and with the following error message appearing in logs:

'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
 got 0/0 replicas'
 ready: false
 type: Kubernetes

Workaround:

  1. Disable Docker Content Trust (DCT):

    1. Access the MKE web UI as admin.

    2. Navigate to Admin > Admin Settings.

    3. In the left navigation pane, click Docker Content Trust and disable it.

  2. Restart the affected deployments such as calico-kube-controllers, compose, compose-api, coredns, and so on:

    kubectl -n kube-system delete deployment <deploymentName>
    

    Once done, the cluster deployment or update resumes.

  3. Re-enable DCT.



Equinix Metal

[20467] Failure to deploy an Equinix Metal based management cluster

Fixed in 2.16.0

Deployment of an Equinix Metal based management cluster with private networking may fail with the following error message during the Ironic deployment. The issue is caused by csi-rbdplugin provisioner pods that got stuck.

0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.

The workaround is to restart the csi-rbdplugin provisioner pods:

kubectl -n rook-ceph delete pod -l app=csi-rbdplugin-provisioner


Bare metal

[20745] Namespace deletion failure after managed cluster removal

Fixed in 2.16.0

After removal of a managed cluster, the namespace is not deleted due to KaaSCephOperationRequest CRs blocking the deletion. The workaround is to manually remove finalizers and delete the KaaSCephOperationRequest CRs.

Workaround:

  1. Remove finalizers from all KaaSCephOperationRequest resources:

    kubectl -n <managed-ns> get kaascephoperationrequest -o name | xargs -I % kubectl -n <managed-ns> patch % -p '{"metadata":{"finalizers":{}}}' --type=merge
    
  2. Delete all KaaSCephOperationRequest resources:

    kubectl -n <managed-ns> delete kaascephoperationrequest --all
    

[17792] Full preflight fails with a timeout waiting for BareMetalHost

If you run bootstrap.sh preflight with KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:

preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition

Workaround:

  1. Unset full preflight using the unset KAAS_BM_FULL_PREFLIGHT environment variable.

  2. Rerun bootstrap.sh preflight that executes fast preflight instead.


IAM

[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page

Fixed in 2.18.0

During configuration of an identity provider SAML using the Add identity provider menu of the Keycloak admin console, the page style breaks as well as the Save and Cancel buttons disappear.

Workaround:

  1. Log in to the Keycloak admin console.

  2. In the sidebar menu, switch to the Master realm.

  3. Navigate to Realm Settings > Themes.

  4. In the Admin Console Theme drop-down menu, select keycloak.

  5. Click Save and refresh the browser window to apply the changes.


LCM

[22341] The cordon-drain states are not removed after maintenance mode is unset

Fixed in 2.17.0

The cordon-drain states are not removed after the maintenance mode is unset for a machine. This issue may occur due to the maintenance transition being stuck on the NodeWorkloadLock object.

Workaround:

Select from the following options:

  • Disable the maintenance mode on the affected cluster as described in Enable cluster and machine maintenance mode.

  • Edit LCMClusterState in the spec section by setting value to "false":

    kubectl edit lcmclusterstates -n <projectName> <LCMCLusterStateName>
    
    apiVersion: lcm.mirantis.com/v1alpha1
    kind: LCMClusterState
    metadata:
      ...
    spec:
      ...
      value: "false"
    

Monitoring

[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.

On a managed cluster, the StackLight pods may get stuck with the Pod predicate NodeAffinity failed error in the pod status. The issue may occur if the StackLight node label was added to one machine and then removed from another one.

The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.

As a workaround, remove the stuck pods:

kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>

[21646] The kaas-exporter container is periodically throttled and OOMKilled

Fixed in 2.16.0

On the highly loaded clusters, the kaas-exporter resource limits for CPU and RAM are lower than the consumed amount of resources. As a result, the kaas-exporter container is periodically throttled and OOMKilled preventing the Container Cloud metrics gathering.

As a workaround, increase the default resource limits for kaas-exporter in the Cluster object of the management cluster. For example:

spec:
  ...
  providerSpec:
    ...
    value:
      ...
      kaas:
        management:
          helmReleases:
          ...
          - name: kaas-exporter
            values:
              resources:
                limits:
                  cpu: 100m
                  memory: 200Mi


Upgrade

[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck

Affects Ubuntu-based clusters deployed after Feb 10, 2022

If you deploy an Ubuntu-based cluster using the deprecated Cluster release 7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022, the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck while applying the Deploy state to the cluster machines. The issue affects all cluster types: management, regional, and managed.

To verify that the cluster is affected:

  1. Log in to the Container Cloud web UI.

  2. In the Clusters tab, capture the RELEASE and AGE values of the required Ubuntu-based cluster. If the values match the ones from the issue description, the cluster may be affected.

  3. Using SSH, log in to the manager or worker node that got stuck while applying the Deploy state and identify the containerd package version:

    containerd --version
    

    If the version is 1.5.9, the cluster is affected.

  4. In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible deployment logs contain the following errors that indicate that the cluster is affected:

    The following packages will be upgraded:
      docker-ee docker-ee-cli
    The following packages will be DOWNGRADED:
      containerd.io
    
    STDERR:
    E: Packages were downgraded and -y was used without --allow-downgrades.
    

Workaround:

Warning

Apply the steps below to the affected nodes one-by-one and only after each consecutive node gets stuck on the Deploy phase with the Ansible log errors. Such sequence ensures that each node is cordon-drained and Docker is properly stopped. Therefore, no workloads are affected.

  1. Using SSH, log in to the first affected node and install containerd 1.5.8:

    apt-get install containerd.io=1.5.8-1 -y --allow-downgrades --allow-change-held-packages
    
  2. Wait for Ansible to reconcile. The node should become Ready in several minutes.

  3. Wait for the next node of the cluster to get stuck on the Deploy phase with the Ansible log errors. Only after that, apply the steps above on the next node.

  4. Patch the remaining nodes one-by-one using the steps above.

[20189] Container Cloud web UI reports upgrade while running previous release

Fixed in 2.16.0

Under certain conditions, the upgrade of the baremetal-based management cluster may get stuck even though the Container Cloud web UI reports a successful upgrade. The issue is caused by inconsistent metadata in IPAM that prevents automatic allocation of the Ceph network. It happens when IPAddr objects associated with the management cluster nodes refer to a non-existent Subnet object by the resource UID.

To verify whether the cluster is affected:

  1. Inspect the baremetal-provider logs:

    kubectl -n kaas logs deployments/baremetal-provider
    

    If the logs contain the following entries, the cluster may be affected:

    Ceph public network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \
    
    Ceph cluster network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \
    
    'default/kaas-mgmt' cluster nodes internal (LCM) IP addresses: 10.64.96.171,10.64.96.172,10.64.96.173 \
    
    failed to configure ceph network for cluster default/kaas-mgmt: \
    Ceph network addresses auto-assignment error: validation failed for Ceph network addresses: \
    error parsing address '': invalid CIDR address:
    

    Empty values of the network parameters in the last entry indicate that the provider cannot locate the Subnet object based on the data from the IPAddr object.

    Note

    In the logs, capture the internal (LCM) IP addresses of the cluster nodes to use them later in this procedure.

  2. Validate the network address used for Ceph by inspecting the MiraCeph object:

    kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +clusterNet:"
    kubectl -n ceph-lcm-mirantis get miraceph -o yaml | egrep "^ +publicNet:"
    

    In the system response, verify that the clusterNet and publicNet values do not contain the 0.0.0.0/0 range.

    Example of the system response on the affected cluster:

    clusterNet: 0.0.0.0/0
    
    publicNet: 0.0.0.0/0
    

Workaround:

  1. Add a label to the Subnet object:

    Note

    To obtain the correct name of the label, use one of the cluster nodes internal (LCM) IP addresses from the baremetal-provider logs.

    1. Add SUBNETID as an environment variable to the IPAddr object. For example:

      SUBNETID=$(kubectl get ipaddr -n default --selector=ipam/IP=10.64.96.171 -o custom-columns=":metadata.labels.ipam/SubnetID" | tr -d '\n')
      
    2. Use the SUBNETID variable to restore the required label in the Subnet object:

      kubectl -n default label subnet master-region-one ipam/UID-${SUBNETID}="1"
      
  2. Verify that the cluster.sigs.k8s.io/cluster-name label exists for IPaddr objects:

    kubectl -n default get ipaddr --show-labels|grep "cluster.sigs.k8s.io/cluster-name"
    

    Skip the next step if all IPaddr objects corresponding to the management cluster nodes have this label.

  3. Add the cluster.sigs.k8s.io/cluster-name label to IPaddr objects:

    IPADDRNAMES=$(kubectl -n default get ipaddr -o custom-columns=":metadata.name")
    for IP in $IPADDRNAMES; do kubectl -n default label ipaddr $IP cluster.sigs.k8s.io/cluster-name=<managementClusterName>; done
    

    In the command above, substitute <managementClusterName> with the corresponding value.


[16379,23865] Cluster update fails with the FailedMount warning

Fixed in 2.19.0

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

  1. Verify that the description of the pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    
    • <affectedProjectName> is the Container Cloud project name where the pods failed to run

    • <affectedPodName> is a pod name that failed to run in this project

    In the pod description, identify the node name where the pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

  4. On every csi-rbdplugin pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.



Container Cloud web UI

[249] A newly created project does not display in the Container Cloud web UI

Affects only Container Cloud 2.18.0 and earlier

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.