Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.8.0 including the Cluster release 5.15.0 and 6.14.0.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


AWS

[8013] Managed cluster deployment requiring PVs may fail

Fixed in the Cluster release 7.0.0

Note

The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

On a management cluster with multiple AWS-based managed clusters, some clusters fail to complete the deployments that require persistent volumes (PVs), for example, Elasticsearch. Some of the affected pods get stuck in the Pending state with the pod has unbound immediate PersistentVolumeClaims and node(s) had volume node affinity conflict errors.

Warning

The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.

Workaround:

  1. Obtain the persistent volume claims related to the storage mounts of the affected pods:

    kubectl get pod/<pod_name1> pod/<pod_name2> \
    -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
    

    Note

    In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.

  2. Delete the affected Pods and PersistentVolumeClaims to reschedule them: For example, for StackLight:

    kubectl -n stacklight delete \
    
      pod/<pod_name1> pod/<pod_name2> ...
      pvc/<pvc_name2> pvc/<pvc_name2> ...
    


vSphere

[15698] VIP is assigned to each manager node instead of a single node

Fixed in 2.11.0

A load balancer virtual IP address (VIP) is assigned to each manager node on any type of the vSphere-based cluster. The issue occurs because the Keepalived instances cannot set up a cluster due to the blocked vrrp protocol traffic in the firewall configuration on the Container Cloud nodes.

Note

Before applying the workaround below, verify that the dedicated vSphere network does not have any other virtual machines with the keepalived instance running with the same vrouter_id.

You can verify the vrouter_id value of the cluster in /etc/keepalived/keepalived.conf on the manager nodes.

Workaround

Update the firewalld configuration on each manager node of the affected cluster to allow the vrrp protocol traffic between the nodes:

  1. SSH to any manager node using mcc-user.

  2. Apply the firewalld configuration:

    firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent
    firewall-cmd --reload
    
  3. Apply the procedure to the remaining manager nodes of the cluster.


[14080] Node leaves the cluster after IP address change

A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.

The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.

By default, a DHCP server keeps lease of the IP address for 30 minutes. Usually, a VM dhclient prolongs such lease by frequent DHCP requests to the server before the lease period ends. The DHCP prolongation request period is always less than the default lease time on the DHCP server, so prolongation usually works. But in case of network issues, for example, when dhclient from the VM cannot reach the DHCP server, or the VM is being slowly powered on for more than the lease time, such VM may lose its assigned IP address. As a result, it obtains a new IP address.

Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.

Symptoms:

  • One of the nodes is in the NodeNotReady or down state:

    kubectl get nodes -o wide
    docker node ls
    
  • The UCP Swarm manager logs on the healthy manager node contain the following example error:

    docker logs -f ucp-swarm-manager
    
    level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
    
  • If the affected node is manager:

    • The output of the docker info command contains the following example error:

      Error: rpc error: code = Unknown desc = The swarm does not have a leader. \
      It's possible that too few managers are online. \
      Make sure more than half of the managers are online.
      
    • The UCP controller logs contain the following example error:

      docker logs -f ucp-controller
      
      "warning","msg":"Node State Active check error: \
      Swarm Mode Manager health check error: \
      info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \
      Is the docker daemon running?
      
  • On the affected node, the IP address on the first interface eth0 does not match the IP address configured in Docker. Verify the Node Address field in the output of the docker info command.

  • The following lines are present in /var/log/messages:

    dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
    

    If there are several lines where the IP is different, the node is affected.

Workaround:

Select from the following options:

  • Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.

  • Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface eth0 on VMs with a static IP address.

  • If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.

[14458] Failure to create a container for pod: cannot allocate memory

Fixed in 2.9.0 for new clusters

Newly created pods may fail to run and have the CrashLoopBackOff status on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware vSphere provider. The following is an example output of the kubectl describe pod <pod-name> -n <projectName> command:

State:        Waiting
Reason:       CrashLoopBackOff
Last State:   Terminated
Reason:       ContainerCannotRun
Message:      OCI runtime create failed: container_linux.go:349:
              starting container process caused "process_linux.go:297:
              applying cgroup configuration for process caused
              "mkdir /sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
              cannot allocate memory": unknown

The issue occurs due to the Kubernetes and Docker community issues.

According to the RedHat solution, the workaround is to disable the kernel memory accounting feature by appending cgroup.memory=nokmem to the kernel command line.

Note

The workaround below applies to the existing clusters only. The issue is resolved for new Container Cloud 2.9.0 deployments since the workaround below automatically applies to the VM template built during the vSphere-based management cluster bootstrap.

Apply the following workaround on each machine of the affected cluster.

Workaround

  1. SSH to any machine of the affected cluster using mcc-user and the SSH key provided during the cluster creation to proceed as the root user.

  2. In /etc/default/grub, set cgroup.memory=nokmem for GRUB_CMDLINE_LINUX.

  3. Update kernel:

    yum install kernel kernel-headers kernel-tools kernel-tools-libs kexec-tools
    
  4. Update the grub configuration:

    grub2-mkconfig -o /boot/grub2/grub.cfg
    
  5. Reboot the machine.

  6. Wait for the machine to become available.

  7. Wait for 5 minutes for Docker and Kubernetes services to start.

  8. Verify that the machine is Ready:

    docker node ls
    kubectl get nodes
    
  9. Repeat the steps above on the remaining machines of the affected cluster.



OpenStack

[10424] Regional cluster cleanup fails by timeout

An OpenStack-based regional cluster cleanup fails with the timeout error.

Workaround:

  1. Wait for the Cluster object to be deleted in the bootstrap cluster:

    kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
    

    The system output must be empty.

  2. Remove the bootstrap cluster manually:

    ./bin/kind delete cluster --name clusterapi
    


Bare metal

[7655] Wrong status for an incorrectly configured L2 template

Fixed in 2.11.0

If an L2 template is configured incorrectly, a bare metal cluster is deployed successfully but with the runtime errors in the IpamHost object.

Workaround:

If you suspect that the machine is not working properly because of incorrect network configuration, verify the status of the corresponding IpamHost object. Inspect the l2RenderResult and ipAllocationResult object fields for error messages.



Storage

[14051] CephCluster creation fails if manageOsds is enabled before deploy

Fixed in 2.9.0

If manageOsds is enabled in the pre-deployment KaaSCephCluster template, the bare metal management or managed cluster fails to deploy due to the CephCluster creation failure.

As a workaround, disable manageOsds in the KaaSCephCluster template before the cluster deployment. You can enable this parameter after deployment as described in Operations Guide: Enable automated Ceph LCM.

[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement

Fixed in 2.11.0

If you use a custom BareMetalHostProfile, after disk replacement on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state due to the Ceph OSD authorization key failing to be created properly.

Workaround:

  1. Export kubeconfig of your managed cluster. For example:

    export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
    
  2. Log in to the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  3. Delete the authorization key for the failed Ceph OSD:

    ceph auth del osd.<ID>
    
  4. SSH to the node on which the Ceph OSD cannot be created.

  5. Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.

    Note

    Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.

  6. On the managed cluster, restart the Rook operator:

    kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
    


IAM

[13385] MariaDB pods fail to start after SST sync

Fixed in 2.12.0

The MariaDB pods fail to start after MariaDB blocks itself during the State Snapshot Transfers sync.

Workaround:

  1. Verify the failed pod readiness:

    kubectl describe pod -n kaas <failedMariadbPodName>
    

    If the readiness probe failed with the WSREP not synced message, proceed to the next step. Otherwise, assess the MariaDB pod logs to identify the failure root cause.

  2. Obtain the MariaDB admin password:

    kubectl get secret -n kaas mariadb-dbadmin-password -o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d ; echo
    
  3. Verify that wsrep_local_state_comment is Donor or Desynced:

    kubectl exec -it -n kaas <failedMariadbPodName> -- mysql -uroot -p<mariadbAdminPassword> -e "SHOW status LIKE \"wsrep_local_state_comment\";"
    
  4. Restart the failed pod:

    kubectl delete pod -n kaas <failedMariadbPodName>
    


LCM

[13402] Cluster fails with error: no space left on device

Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters

If an application running on a Container Cloud management or managed cluster fails frequently, for example, PostgreSQL, it may produce an excessive amount of core dumps. This leads to the no space left on device error on the cluster nodes and, as a result, to the broken Docker Swarm and the entire cluster.

Core dumps are disabled by default on the operating system of the Container Cloud nodes. But since Docker does not inherit the operating system settings, disable core dumps in Docker using the workaround below.

Warning

The workaround below does not apply to the baremetal-based clusters, including MOS deployments, since Docker restart may destroy the Ceph cluster.

Workaround:

  1. SSH to any machine of the affected cluster using mcc-user and the SSH key provided during the cluster creation.

  2. In /etc/docker/daemon.json, add the following parameters:

    {
        ...
        "default-ulimits": {
            "core": {
                "Hard": 0,
                "Name": "core",
                "Soft": 0
            }
        }
    }
    
  3. Restart the Docker daemon:

    systemctl restart docker
    
  4. Repeat the steps above on each machine of the affected cluster one by one.


[13845] Cluster update fails during the LCM agent upgrade with x509 error

Fixed in 2.11.0

During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0, the LCM agent upgrade fails with the following error in logs:

lcmAgentUpgradeStatus:
    error: 'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent:
      x509: certificate signed by unknown authority'

Only clusters initially deployed using Container Cloud 2.4.0 or earlier are affected.

As a workaround, restart lcm-agent using the service lcm-agent-* restart command on the affected nodes.


[6066] Helm releases get stuck in FAILED or UNKNOWN state

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

  1. Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.
    

    Example of system response:

    stacklight:
    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
    
  2. Log in to the helm-controller pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
    
  3. Remove the failed release. For example:

    ./helm --host=localhost:44134 delete stacklight
    

    If the version of the failed Helm release is v3:

    1. Download the Helm v3 binary. For details, see official Helm documentation.

    2. Remove the failed release:

      helm delete <failed-release-name>
      

    Once done, the release triggers for redeployment.


[14125] Inaccurate nodes readiness status on a managed cluster

Fixed in 2.10.0

A managed cluster deployed or updated on a regional cluster of another provider type may display inaccurate Nodes readiness live status in the Container Cloud web UI. While all nodes are ready, the Nodes status indicates that some nodes are still not ready.

The issue occurs due to the cordon-drain desynchronization between the LCMClusterState objects and the actual state of the cluster.

Note

The workaround below must be applied only by users with the writer or cluster-admin access role assigned by the Infrastructure Operator.

To verify that the cluster is affected:

  1. Export the regional cluster kubeconfig created during the regional cluster deployment:

    export KUBECONFIG=<PathToRegionalClusterKubeconfig>
    
  2. Verify that all Kubernetes nodes of the affected managed cluster are in the ready state:

    kubectl --kubeconfig <managedClusterKubeconfigPath> get nodes
    
  3. Verify that all Swarm nodes of the managed cluster are in the ready state:

    ssh -i <sshPrivateKey> root@<controlPlaneNodeIP>
    
    docker node ls
    

    Replace the parameters enclosed in angle brackets with the SSH key that was used for the managed cluster deployment and the private IP address of any control plane node of the cluster.

    If the status of the Kubernetes and Swarm nodes is ready, proceed with the next steps. Otherwise, assess the cluster logs to identify the issue with not ready nodes.

  4. Obtain the LCMClusterState items related to the swarm-drain and cordon-drain type:

    kubectl get lcmlusterstates -n <managedClusterProjectName>
    

    The command above outputs the list of all LCMClusterState items. Verify only the LCMClusterState items names that start with the swarm-drain- and cordon-drain- prefix.

  5. Verify the status of each LCMClusterState item of the swarm-drain and cordon-drain type:

    kubectl -n <clusterProjectName>  get lcmlusterstates <lcmlusterstatesItemNameOfSwarmDrainOrCordonDrainType> -o=yaml
    

    Example of system response extract for the LCMClusterState items of the cordon-drain type:

    spec:
     arg: kaas-node-4c026e7a-8acd-48b2-bf5c-cdeaf99d812f
     clusterName: test-child-namespace
     type: cordon-drain
     value: "false"
    status:
      attempt: 0
      value: "false"
    

    Example of system response extract for the LCMClusterState items of the swarm-drain type:

    spec:
      arg: kaas-node-4c026e7a-8acd-48b2-bf5c-cdeaf99d812f
      clusterName: test-child-namespace
      type: swarm-drain
      value: "true"
    status:
      attempt: 334
      message: 'Error: waiting for kubernetes node kaas-node-4c026e7a-8acd-48b2-bf5c-cdeaf99d812f
        to be drained first'
    

    The cluster is affected if:

    • For cordon-drain, spec.value and status.value are "false"

    • For swarm-drain, spec.value is "true" and the status.message contains an error related to waiting for the Kubernetes cordon-drain to finish

Workaround:

For each LCMClusterState item of the swarm-drain type with spec.value == "true" and the status.message described above, replace "true" with "false" in spec.value:

kubectl -n <clusterProjectName> edit lcmclusterstate <lcmlusterstatesItemNameOfSwarmDrainType>


Upgrade

[13292] Local volume provisioner pod stuck in Terminating status after upgrade

After upgrade of Container Cloud from 2.6.0 to 2.7.0, the local volume provisioner pod in the default project is stuck in the Terminating status, even after upgrade to 2.8.0.

This issue does not affect functioning of the management, regional, or managed clusters. The issue does not prevent the successful upgrade of the cluster.

Workaround:

  1. Verify that the cluster is affected:

    kubectl get pods -n default | grep local-volume-provisioner
    

    If the output contains a pod with the Terminating status, the cluster is affected.

    Capture the affected pod name, if any.

  2. Delete the affected pod:

    kuebctl -n default delete pod <LVPPodName> --force
    

[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update

Helm releases may get stuck in the PENDING_UPGRADE status during a management or managed cluster upgrade. The HelmBundle controller cannot recover from this state and requires manual actions. The workaround below describes the recovery process for the openstack-operator release that stuck during a managed cluster update. Use it as an example for other Helm releases as required.

Workaround:

  1. Log in to the helm-controller pod console:

    kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
    
  2. Identify the release that stuck in the PENDING_UPGRADE status. For example:

    ./helm --host=localhost:44134 history openstack-operator
    

    Example of system response:

    REVISION  UPDATED                   STATUS           CHART                      DESCRIPTION
    1         Tue Dec 15 12:30:41 2020  SUPERSEDED       openstack-operator-0.3.9   Install complete
    2         Tue Dec 15 12:32:05 2020  SUPERSEDED       openstack-operator-0.3.9   Upgrade complete
    3         Tue Dec 15 16:24:47 2020  PENDING_UPGRADE  openstack-operator-0.3.18  Preparing upgrade
    
  3. Roll back the failed release to the previous revision. For example:

    ./helm --host=localhost:44134 rollback openstack-operator 2
    

    If the version of the failed Helm release is v3:

    1. Download the Helm v3 binary. For details, see official Helm documentation.

    2. Roll back the failed release:

      helm rollback <failed-release-name>
      

    Once done, the release will be reconciled.


[14152] Managed cluster upgrade fails due to DNS issues

Fixed in 2.10.0

A managed cluster release upgrade may fail due to DNS issues on pods with host networking. If this is the case, the DNS names of the Kubernetes services on the affected pod cannot be resolved.

Workaround:

  1. Export kubeconfig of the affected managed cluster. For example:

    export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
    
  2. Identify any existing pod with host networking. For example, tf-config-xxxxxx:

    kubectl get pods -n tf -l app=tf-config
    
  3. Verify the DNS names resolution of the Kubernetes services from this pod. For example:

    kubectl -n tf exec -it tf-config-vl4mh -c svc-monitor -- curl -k https://kubernetes.default.svc
    

    The system output must not contain DNS errors.

  4. If the DNS name cannot be resolved, restart all calico-node pods:

    kubectl delete pods -l k8s-app=calico-node -n kube-system
    


Container Cloud web UI

[249] A newly created project does not display in the Container Cloud web UI

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.