Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.6.0 including the Cluster release 5.13.0 and 6.12.0.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


AWS

[8013] Managed cluster deployment requiring PVs may fail

Fixed in the Cluster release 7.0.0

Note

The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

On a management cluster with multiple AWS-based managed clusters, some clusters fail to complete the deployments that require persistent volumes (PVs), for example, Elasticsearch. Some of the affected pods get stuck in the Pending state with the pod has unbound immediate PersistentVolumeClaims and node(s) had volume node affinity conflict errors.

Warning

The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.

Workaround:

  1. Obtain the persistent volume claims related to the storage mounts of the affected pods:

    kubectl get pod/<pod_name1> pod/<pod_name2> \
    -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
    

    Note

    In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.

  2. Delete the affected Pods and PersistentVolumeClaims to reschedule them: For example, for StackLight:

    kubectl -n stacklight delete \
    
      pod/<pod_name1> pod/<pod_name2> ...
      pvc/<pvc_name2> pvc/<pvc_name2> ...
    


vSphere

[12683] The kaas-ipam pods restart on the vSphere region with IPAM disabled

Fixed in Container Cloud 2.7.0

Even though IPAM is disabled on the vSphere-based regional cluster deployed on top of an AWS-based management cluster, the regional cluster still has the kaas-ipam pods installed and continuously restarts them. In this case, the pods logs contain the following exemplary errors:

Waiting for CRDs. [baremetalhosts.metal3.io clusters.cluster.k8s.io machines.cluster.k8s.io
ipamhosts.ipam.mirantis.com ipaddrs.ipam.mirantis.com subnets.ipam.mirantis.com subnetpools.ipam.mirantis.com \
l2templates.ipam.mirantis.com] not found yet
E0318 11:58:21.067502  1 main.go:240] Fetch CRD list failed: \
Object 'Kind' is missing in 'unstructured object has no kind'

As a result, the KubePodCrashLooping StackLight alerts are firing in Alertmanager for kaas-ipam. Disregard these alerts.

[13176] ClusterNetwork settings may disappear from the cluster provider spec

Fixed in Container Cloud 2.7.0

A vSphere-based cluster with IPAM enabled may lose cluster network settings related to IPAM leading to invalid metadata provided to virtual machines. As a result, virtual machines can not obtain assigned IP addresses. The issue occurs during a management cluster bootstrap or a managed cluster creation.

Workaround:

  • If the management cluster with IPAM enabled is not deployed yet, follow the steps below before launching the bootstrap.sh script:

    1. Open kaas-bootstrap/releases/kaas/2.6.0.yaml for editing.

    2. Change the release-controller version from 1.18.1 to 1.18.3:

      - name: release-controller
        version: 1.18.3
        chart: kaas-release/release-controller
        namespace: kaas
        values:
          image:
            tag: 1.18.3
      

    Now, proceed with the management cluster bootstrap.

  • If the management cluster is already deployed, and you want to create a vSphere-based managed cluster with IPAM enabled:

    1. Log in to a local machine where your management or regional cluster kubeconfig is located and export it:

      export KUBECONFIG=kaas-bootstrap/kubeconfig
      
    2. Edit the kaasrelease object by updating the release-controller chart and image version from 1.18.1 to 1.18.3:

      kubectl edit  kaasrelease kaas-2-6-0
      
      - chart: kaas-release/release-controller
        name: release-controller
        namespace: kaas
        values:
          image:
            tag: 1.18.3
        version: 1.18.3
      
    3. Verify that the release-controller deployment is ready with 3/3 replicas:

      kubectl get deployment release-controller-release-controller -n kaas -o=jsonpath='{.status.readyReplicas}/{.status.replicas}'
      

    Now, you can deploy managed clusters with IPAM enabled. For details, see Operations Guide: Create a vSphere-based managed cluster.


Bare metal

[7655] Wrong status for an incorrectly configured L2 template

Fixed in 2.11.0

If an L2 template is configured incorrectly, a bare metal cluster is deployed successfully but with the runtime errors in the IpamHost object.

Workaround:

If you suspect that the machine is not working properly because of incorrect network configuration, verify the status of the corresponding IpamHost object. Inspect the l2RenderResult and ipAllocationResult object fields for error messages.



StackLight

[13078] Elasticsearch does not receive data from Fluentd

Fixed in Container Cloud 2.7.0

Elasticsearch may stop receiving new data from Fluentd. In such case, error messages similar to the following will be present in fluentd-elasticsearch logs:

ElasticsearchError error="400 - Rejected by Elasticsearch [error type]:
illegal_argument_exception [reason]: 'Validation Failed: 1: this action would
add [15] total shards, but this cluster currently has [2989]/[3000] maximum
shards open;'" location=nil tag="ucp-kubelet"

The workaround is to manually increase the limit of open index shards per node:

kubectl -n stacklight exec -ti elasticsearch-master-0 -- \
curl -XPUT -H "content-type: application/json" \
-d '{"persistent":{"cluster.max_shards_per_node": 20000}}' \
http://127.0.0.1:9200/_cluster/settings

Storage

[10060] Ceph OSD node removal fails

Fixed in Container Cloud 2.7.0

A Ceph node removal is not being triggered properly after updating the KaasCephCluster custom resource (CR). Both management and managed clusters are affected.

Workaround:

  1. Remove the parameters for a Ceph OSD from the KaasCephCluster CR as described in Operations Guide: Add, remove, or reconfigure Ceph nodes.

  2. Obtain the IDs of the osd and mon services that are located on the old node:

    1. Obtain the UID of the affected machine:

      kubectl get machine <CephOSDNodeName> -n <ManagedClusterProjectName> -o jsonpath='{.metadata.annotations.kaas\.mirantis\.com\/uid}'
      
    2. Export kubeconfig of your managed cluster. For example:

      export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
      
    3. Identify the pods IDs that run the osd and mon services:

      kubectl get pods -o wide -n rook-ceph | grep <affectedMachineUID> | grep -E "mon|osd"
      

      Example of the system response extract:

      rook-ceph-mon-c-7bbc5d757d-5bpws                              1/1  Running    1  6h1m
      rook-ceph-osd-2-58775d5568-5lklw                              1/1  Running    4  44h
      rook-ceph-osd-prepare-705ae6c647cfdac928c63b63e2e2e647-qn4m9  0/1  Completed  0  94s
      

      The pods IDs include the osd or mon services IDs. In the example system response above, the osd ID is 2 and the mon ID is c.

  3. Delete the deployments of the osd and mon services obtained in the previous step:

    kubectl delete deployment rook-ceph-osd(mon)-<ID> -n rook-ceph
    

    For example:

    kubectl delete deployment rook-ceph-mon-c -n rook-ceph
    kubectl delete deployment rook-ceph-osd-2 -n rook-ceph
    
  4. Log in to the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  5. Rebalance the Ceph OSDs:

    ceph osd out osd(s).ID
    

    Wait for the rebalance to complete.

  6. Rebalance the Ceph data:

    ceph osd purge osd(s).ID
    

    Wait for the Ceph data to rebalance.

  7. Remove the old node from the Ceph OSD tree:

    ceph osd crush rm <NodeName>
    
  8. If the removed node contained mon services, remove them:

    ceph mon rm <monID>
    

[7073] Cannot automatically remove a Ceph node

When removing a worker node, it is not possible to automatically remove a Ceph node. The workaround is to manually remove the Ceph node from the Ceph cluster as described in Operations Guide: Add, remove, or reconfigure Ceph nodes before removing the worker node from your deployment.

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot Ceph.

[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement

Fixed in 2.11.0

If you use a custom BareMetalHostProfile, after disk replacement on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state due to the Ceph OSD authorization key failing to be created properly.

Workaround:

  1. Export kubeconfig of your managed cluster. For example:

    export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
    
  2. Log in to the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  3. Delete the authorization key for the failed Ceph OSD:

    ceph auth del osd.<ID>
    
  4. SSH to the node on which the Ceph OSD cannot be created.

  5. Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.

    Note

    Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.

  6. On the managed cluster, restart the Rook operator:

    kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
    

[12723] ceph_role_* labels remain after deleting a node from KaaSCephCluster

Fixed in 2.8.0

The ceph_role_mon and ceph_role_mgr labels that Ceph controller assigns to a node during a Ceph cluster creation are not automatically removed after deleting a node from KaaSCephCluster.

As a workaround, manually remove the labels using the following commands:

kubectl unlabel node <nodeName> ceph_role_mon
kubectl unlabel node <nodeName> ceph_role_mgr

LCM

[13402] Cluster fails with error: no space left on device

Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters

If an application running on a Container Cloud management or managed cluster fails frequently, for example, PostgreSQL, it may produce an excessive amount of core dumps. This leads to the no space left on device error on the cluster nodes and, as a result, to the broken Docker Swarm and the entire cluster.

Core dumps are disabled by default on the operating system of the Container Cloud nodes. But since Docker does not inherit the operating system settings, disable core dumps in Docker using the workaround below.

Warning

The workaround below does not apply to the baremetal-based clusters, including MOS deployments, since Docker restart may destroy the Ceph cluster.

Workaround:

  1. SSH to any machine of the affected cluster using mcc-user and the SSH key provided during the cluster creation.

  2. In /etc/docker/daemon.json, add the following parameters:

    {
        ...
        "default-ulimits": {
            "core": {
                "Hard": 0,
                "Name": "core",
                "Soft": 0
            }
        }
    }
    
  3. Restart the Docker daemon:

    systemctl restart docker
    
  4. Repeat the steps above on each machine of the affected cluster one by one.


[10029] Authentication fails with the 401 Unauthorized error

Authentication may not work on some controller nodes after a managed cluster creation. As a result, the Kubernetes API operations with the managed cluster kubeconfig fail with Response Status: 401 Unauthorized.

As a workaround, manually restart the ucp-controller and ucp-auth Docker services on the affected node.

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

[6066] Helm releases get stuck in FAILED or UNKNOWN state

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

  1. Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.
    

    Example of system response:

    stacklight:
    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
    
  2. Log in to the helm-controller pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
    
  3. Remove the failed release. For example:

    ./helm --host=localhost:44134 delete stacklight
    

    If the version of the failed Helm release is v3:

    1. Download the Helm v3 binary. For details, see official Helm documentation.

    2. Remove the failed release:

      helm delete <failed-release-name>
      

    Once done, the release triggers for redeployment.



Management and regional clusters

[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update

Helm releases may get stuck in the PENDING_UPGRADE status during a management or managed cluster upgrade. The HelmBundle controller cannot recover from this state and requires manual actions. The workaround below describes the recovery process for the openstack-operator release that stuck during a managed cluster update. Use it as an example for other Helm releases as required.

Workaround:

  1. Log in to the helm-controller pod console:

    kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
    
  2. Identify the release that stuck in the PENDING_UPGRADE status. For example:

    ./helm --host=localhost:44134 history openstack-operator
    

    Example of system response:

    REVISION  UPDATED                   STATUS           CHART                      DESCRIPTION
    1         Tue Dec 15 12:30:41 2020  SUPERSEDED       openstack-operator-0.3.9   Install complete
    2         Tue Dec 15 12:32:05 2020  SUPERSEDED       openstack-operator-0.3.9   Upgrade complete
    3         Tue Dec 15 16:24:47 2020  PENDING_UPGRADE  openstack-operator-0.3.18  Preparing upgrade
    
  3. Roll back the failed release to the previous revision. For example:

    ./helm --host=localhost:44134 rollback openstack-operator 2
    

    If the version of the failed Helm release is v3:

    1. Download the Helm v3 binary. For details, see official Helm documentation.

    2. Roll back the failed release:

      helm rollback <failed-release-name>
      

    Once done, the release will be reconciled.


[10424] Regional cluster cleanup fails by timeout

An OpenStack-based regional cluster cleanup fails with the timeout error.

Workaround:

  1. Wait for the Cluster object to be deleted in the bootstrap cluster:

    kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
    

    The system output must be empty.

  2. Remove the bootstrap cluster manually:

    ./bin/kind delete cluster --name clusterapi
    


Container Cloud web UI

[249] A newly created project does not display in the Container Cloud web UI

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.