Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.14.0 including the Cluster releases 7.4.0, 6.20.0, and 5.21.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


Bare metal

[20745] Namespace deletion failure after managed cluster removal

Fixed in 2.16.0

After removal of a managed cluster, the namespace is not deleted due to KaaSCephOperationRequest CRs blocking the deletion. The workaround is to manually remove finalizers and delete the KaaSCephOperationRequest CRs.

Workaround:

  1. Remove finalizers from all KaaSCephOperationRequest resources:

    kubectl -n <managed-ns> get kaascephoperationrequest -o name | xargs -I % kubectl -n <managed-ns> patch % -p '{"metadata":{"finalizers":{}}}' --type=merge
    
  2. Delete all KaaSCephOperationRequest resources:

    kubectl -n <managed-ns> delete kaascephoperationrequest --all
    

[19786] Managed cluster deployment fails due to the dnsmasq-dhcpd logs overflow

Fixed in 2.15.0

A managed cluster deployment fails on long-running management clusters with BareMetalHost being stuck in the Preparing state and the ironic-conductor and ironic-api pods reporting the not enough disk space error due to the dnsmasq-dhcpd logs overflow.

Workaround:

  1. Log in to the ironic-conductor pod.

  2. Verify the free space in /volume/log/dnsmasq.

    • If the free space on a volume is less than 10%:

      1. Manually delete log files in /volume/log/dnsmasq/.

      2. Scale down the dnsmasq pod to 0 replicas:

        kubectl -n kaas scale deployment dnsmasq --replicas=0
        
      3. Scale up the dnsmasq pod to 1 replica:

        kubectl -n kaas scale deployment dnsmasq --replicas=1
        
    • If the volume has enough space, assess the Ironic logs to identify the root cause of the issue.


[17792] Full preflight fails with a timeout waiting for BareMetalHost

If you run bootstrap.sh preflight with KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:

preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition

Workaround:

  1. Unset full preflight using the unset KAAS_BM_FULL_PREFLIGHT environment variable.

  2. Rerun bootstrap.sh preflight that executes fast preflight instead.


vSphere

[19737] The vSphere VM template build hangs with an empty kickstart file

Fixed in 2.15.0

On the vSphere deployments with the RHEL 8.4 seed node, the VM template build for deployment hangs because of an empty kickstart file provided to the VM. In this case, the VMware web console displays the following error for the affected VM:

Kickstart file /run/install/ks.cfg is missing

The fix for the issue is implemented in the latest version of the Packer image for the VM template build.

Workaround:

  1. Open bootstrap.sh in the kaas-bootstrap folder for editing.

  2. Update the Docker image tag for the VSPHERE_PACKER_DOCKER_IMAGE variable to v1.0-39.

  3. Save edits and restart the VM template build:

    ./bootstrap.sh vsphere_template
    

[19468] ‘Failed to remove finalizer from machine’ error during cluster deletion

Fixed in 2.15.0

If a RHEL license is removed before the related managed cluster is deleted, the cluster deletion hangs with the following Machine object error:

Failed to remove finalizer from machine ...
failed to get RHELLicense object

As a workaround, recreate the removed RHEL license object with the same name using the Container Cloud web UI or API.


[14080] Node leaves the cluster after IP address change

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.

The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.

By default, a DHCP server keeps lease of the IP address for 30 minutes. Usually, a VM dhclient prolongs such lease by frequent DHCP requests to the server before the lease period ends. The DHCP prolongation request period is always less than the default lease time on the DHCP server, so prolongation usually works. But in case of network issues, for example, when dhclient from the VM cannot reach the DHCP server, or the VM is being slowly powered on for more than the lease time, such VM may lose its assigned IP address. As a result, it obtains a new IP address.

Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.

Symptoms:

  • One of the nodes is in the NodeNotReady or down state:

    kubectl get nodes -o wide
    docker node ls
    
  • The UCP Swarm manager logs on the healthy manager node contain the following example error:

    docker logs -f ucp-swarm-manager
    
    level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
    
  • If the affected node is manager:

    • The output of the docker info command contains the following example error:

      Error: rpc error: code = Unknown desc = The swarm does not have a leader. \
      It's possible that too few managers are online. \
      Make sure more than half of the managers are online.
      
    • The UCP controller logs contain the following example error:

      docker logs -f ucp-controller
      
      "warning","msg":"Node State Active check error: \
      Swarm Mode Manager health check error: \
      info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \
      Is the docker daemon running?
      
  • On the affected node, the IP address on the first interface eth0 does not match the IP address configured in Docker. Verify the Node Address field in the output of the docker info command.

  • The following lines are present in /var/log/messages:

    dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
    

    If there are several lines where the IP is different, the node is affected.

Workaround:

Select from the following options:

  • Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.

  • Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface eth0 on VMs with a static IP address.

  • If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.


LCM

[6066] Helm releases get stuck in FAILED or UNKNOWN state

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

  1. Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.
    

    Example of system response:

    stacklight:
    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
    
  2. Log in to the helm-controller pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
    
  3. Download the Helm v3 binary. For details, see official Helm documentation.

  4. Remove the failed release:

    helm delete <failed-release-name>
    

    For example:

    helm delete stacklight
    

    Once done, the release triggers for redeployment.



IAM

[21024] Adding a custom certificate for Keycloak hangs with a timeout warning

Fixed in 2.15.0

Adding a custom certificate for Keycloak using the container-cloud binary hangs with the failed to wait for OIDC certificate to be updated timeout warning. The readiness check fails due to a wrong condition.

Ignore the timeout warning. If you can log in to the Container Cloud web UI, the certificate has been applied successfully.


[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page

Fixed in 2.18.0

During configuration of an identity provider SAML using the Add identity provider menu of the Keycloak admin console, the page style breaks as well as the Save and Cancel buttons disappear.

Workaround:

  1. Log in to the Keycloak admin console.

  2. In the sidebar menu, switch to the Master realm.

  3. Navigate to Realm Settings > Themes.

  4. In the Admin Console Theme drop-down menu, select keycloak.

  5. Click Save and refresh the browser window to apply the changes.


StackLight

[18933] Alerta pods fail to pass the readiness check

Fixed in 2.15.0

Occasionally, an Alerta pod may be not Ready even if Patroni, the Alerta back end, operates correctly. In this case, some of the following errors may appear in the Alerta logs:

2021-10-25 13:10:55,865 DEBG 'nginx' stdout output:
2021/10/25 13:10:55 [crit] 25#25: *17408 connect() to unix:/tmp/uwsgi.sock failed (2: No such file or directory) while connecting to upstream, client: 127.0.0.1, server: , request: "GET /api/config HTTP/1.1", upstream: "uwsgi://unix:/tmp/uwsgi.sock:", host: "127.0.0.1:8080"
ip=\- [\25/Oct/2021:13:10:55 +0000] "\GET /api/config HTTP/1.1" \502 \157 "\-" "\python-requests/2.24.0"
/web | /api/config | > GET /api/config HTTP/1.1
2021-11-11 00:02:23,969 DEBG 'nginx' stdout output:
2021/11/11 00:02:23 [error] 23#23: *2014 connect() to unix:/tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 172.16.37.243, server: , request: "GET /api/services HTTP/1.1", upstream: "uwsgi://unix:/tmp/uwsgi.sock:", host: "10.233.113.143:8080"
ip=\- [\11/Nov/2021:00:02:23 +0000] "\GET /api/services HTTP/1.1" \502 \157 "\-" "\kube-probe/1.20+"
/web | /api/services | > GET /api/services HTTP/1.1

As a workaround, manually restart the affected Alerta pods:

kubectl delete pod -n stacklight <POD_NAME>

[19682] URLs in Salesforce alerts use HTTP for IAM with enabled TLS

Fixed in 2.15.0

Prometheus web UI URLs in StackLight notifications sent to Salesforce use a wrong protocol: HTTP instead of HTTPS. The issue affects deployments with TLS enabled for IAM.

The workaround is to manually change the URL protocol in the web browser.


Storage

[20312] Creation of ceph-based PVs gets stuck in Pending state

The csi-rbdplugin-provisioner pod (csi-provisioner container) may show constant retries attempting to create a PV if the csi-rbdplugin-provisioner pod was scheduled and started on a node with no connectivity to the Ceph storage. As a result, creation of a Ceph-based persistent volume (PV) may get stuck in the Pending state.

As a workaround manually specify the affinity or toleration rules for the csi-rbdplugin-provisioner pod.

Workaround:

  1. On the managed cluster, open the rook-ceph-operator-config map for editing:

    kubectl edit configmap -n rook-ceph rook-ceph-operator-config
    
  2. To avoid spawning pods on the nodes where this is not needed, set the provisioner node affinity specifying the required node labels. For example:

    CSI_PROVISIONER_NODE_AFFINITY: "role=storage-node; storage=rook, ceph"
    

Note

If needed, you can also specify CSI_PROVISIONER_TOLERATIONS tolerations. For example:

CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node-role.kubernetes.io/controlplane
    operator: Exists
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists

[20355] KaaSCephOperationRequest is cached after recreation with the same name

Fixed in 2.15.0

When creating a new KaaSCephOperationRequest CR with the same name specified in metadata.name as in the previous KaaSCephOperationRequest CR, even if the previous request was deleted manually, the new request includes information about the previous actions and is in the Completed phase. In this case, no removal is performed.

Workaround:

  1. On the management cluster, manually delete the old KaasCephOperationRequest CR with the same metadata.name:

    kubectl -n ceph-lcm-mirantis delete KaasCephOperationRequest <name>
    
  2. On the managed cluster, manually delete the old CephOsdRemoveRequest with the same metadata.name:

    kubectl -n ceph-lcm-mirantis delete CephOsdRemoveRequest <name>
    

[20298] Spec validation failing during KaaSCephOperationRequest creation

Fixed in 2.15.0

Spec validation may fail with the following error when creating a KaaSCephOperationRequest CR:

The KaaSCephOperationRequest "test-remove-osd" is invalid: spec: Invalid value: 1:
spec in body should have at most 1 properties

Workaround:

  1. On the management cluster, open the kaascephoperationrequests.kaas.mirantis.com CRD for editing:

    kubectl edit crd kaascephoperationrequests.kaas.mirantis.com
    
  2. Remove maxProperties: 1 and minProperties: 1 from spec.versions[0].schema.openAPIV3Schema.properties.spec:

    spec:
      maxProperties: 1
      minProperties: 1
    

[19645] Ceph OSD removal request failure during ‘Processing’

Fixed in 2.15.0

Ocassionally, when Processing a Ceph OSD removal request, KaaSCephOperationRequest retries the osd stop command without an interval, which leads to removal request failure.

As a workaround create a new request to proceed with the Ceph OSD removal.

[19574] Ceph OSD removal does not clean up device used for multiple OSDs

Fixed in 2.15.0

When executing a Ceph OSD removal request to remove Ceph OSDs placed on one disk, the request completes without errors but the device itself still keeps the old LVM partitions. As a result, Rook cannot use such device.

The workaround is to manually clean up the affected device as described in Rook documentation: Zapping Devices.


Upgrade

[20459] Cluster upgrade fails with the certificate error during Ansible update

Fixed in 2.15.0

An upgrade of a management or regional cluster originally deployed using the Container Cloud release earlier than 2.8.0 fails with error setting certificate verify locations during Ansible update if a machine contains /usr/local/share/ca-certificates/mcc.crt, which is either empty or invalid. Managed clusters are not affected.

Workaround:

On every machine of the affected management or regional cluster:

  1. Delete /usr/local/share/ca-certificates/mcc.crt.

  2. In /etc/lcm/environment, remove the following line:

    export SSL_CERT_FILE="/usr/local/share/ca-certificates/mcc.crt"
    
  3. Restart lcm-agent:

    systemctl restart lcm-agent-v0.3.0-104-gb7f5e8d8
    

[20455] Cluster upgrade fails on the LCMMachine CRD update

An upgrade of a management or regional cluster originally deployed using the Container Cloud release earlier than 2.8.0 fails with:

  • The LCM Agent version not updating from v0.3.0-67-g25ab9f1a to v0.3.0-105-g6fb89599

  • The following error message appearing in the events of the related LCMMachine:

    kubectl describe lcmmachine <machineName>
    
    Failed to upgrade agent: failed to update agent upgrade status: \
    LCMMachine.lcm.mirantis.com "master-0" is invalid: \
    status.lcmAgentUpgradeStatus.finishedAt: Invalid value: "null": \
    status.lcmAgentUpgradeStatus.finishedAt in body must be of type string: "null"
    

As a workaround, change the preserveUnknownFields value for the LCMMachine CRD to false:

kubectl patch crd lcmmachines.lcm.mirantis.com -p '{"spec":{"preserveUnknownFields":false}}'

[4288] Equinix and MOS managed clusters update failure

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

The Equinix Metal and MOS-based managed clusters may fail to update to the latest Cluster release with kubelet being stuck and reporting authorization errors.

The cluster is affected by the issue if you see the Failed to make webhook authorizer request: context canceled error in the kubelet logs:

docker logs ucp-kubelet --since 5m 2>&1 | grep 'Failed to make webhook authorizer request: context canceled'

As a workaround, restart the ucp-kubelet container on the affected node(s):

ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet

Note

Ignore failures in the output of the first command, if any.


[16379,23865] Cluster update fails with the FailedMount warning

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

  1. Verify that the description of the pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    
    • <affectedProjectName> is the Container Cloud project name where the pods failed to run

    • <affectedPodName> is a pod name that failed to run in this project

    In the pod description, identify the node name where the pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

  4. On every csi-rbdplugin pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.



Container Cloud web UI

[249] A newly created project does not display in the Container Cloud web UI

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.