Known issues

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.12.0 including the Cluster releases 7.2.0, 6.19.0, and 5.19.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.


AWS

[8013] Managed cluster deployment requiring PVs may fail

Fixed in the Cluster release 7.0.0

Note

The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

On a management cluster with multiple AWS-based managed clusters, some clusters fail to complete the deployments that require persistent volumes (PVs), for example, Elasticsearch. Some of the affected pods get stuck in the Pending state with the pod has unbound immediate PersistentVolumeClaims and node(s) had volume node affinity conflict errors.

Warning

The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.

Workaround:

  1. Obtain the persistent volume claims related to the storage mounts of the affected pods:

    kubectl get pod/<pod_name1> pod/<pod_name2> \
    -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
    

    Note

    In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.

  2. Delete the affected Pods and PersistentVolumeClaims to reschedule them: For example, for StackLight:

    kubectl -n stacklight delete \
    
      pod/<pod_name1> pod/<pod_name2> ...
      pvc/<pvc_name2> pvc/<pvc_name2> ...
    


Azure

[17705] Failure to deploy more than 62 Azure worker nodes

Fixed in 2.13.0

The default value of the Ports per instance load balancer outbound NAT setting that is 1024 prevents from deploying more than 62 Azure worker nodes on a managed cluster. To workaround the issue, set the Ports per instance parameter to 256.

Workaround:

  1. Log in to the Azure portal.

  2. Navigate to Home > Load Balancing.

  3. Find and click the load balancer called mcc-<uniqueClusterID>. You can obtain <uniqueClusterID> in the Cluster info field in the Container Cloud web UI.

  4. In the load balancer Settings left-side menu, click Outbound rules > OutboundNATAllProtocols.

  5. In the Outbound ports > Choose by menu, select Ports per instance.

  6. In the Ports per instance field, replace the default 1024 value with 256.

  7. Click Save to apply the new setting.



Bare metal

[18752] Bare metal hosts in ‘provisioned registration error’ state after update

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

After update of a management or managed cluster created using the Container Cloud release earlier than 2.6.0, a bare metal host state is Provisioned in the Container Cloud web UI while having the error state in logs with the following message:

status:
  errorCount: 1
  errorMessage: 'Host adoption failed: Error while attempting to adopt node  7a8d8aa7-e39d-48ec-98c1-ed05eacc354f:
    Validation of image href http://10.10.10.10/images/stub_image.qcow2 failed,
    reason: Got HTTP code 404 instead of 200 in response to HEAD request..'
  errorType: provisioned registration error

The issue is caused by the image URL pointing to an unavailable resource due to the URI IP change during update. As a workaround, update URLs for the bare metal host status and spec with the correct values that use a stable DNS record as a host.

Workaround:

Note

In the commands below, we update master-2 as an example. Replace it with the corresponding value to fit your deployment.

  1. Exit Lens.

  2. In a new terminal, configure access to the affected cluster.

  3. Start kube-proxy:

    kubectl proxy &
    
  4. Pause the reconcile:

    kubectl patch bmh master-2 --type=merge --patch '{"metadata":{"annotations":{"baremetalhost.metal3.io/paused": "true"}}}'
    
  5. Create the payload data with the following content:

    • For status_payload.json:

      {
         "status": {
            "errorCount": 0,
            "errorMessage": "",
            "provisioning": {
               "image": {
                  "checksum": "http://httpd-http/images/stub_image.qcow2.md5sum",
                  "url": "http://httpd-http/images/stub_image.qcow2"
               },
               "state": "provisioned"
            }
         }
      }
      
    • For status_payload.json:

      {
         "spec": {
            "image": {
               "checksum": "http://httpd-http/images/stub_image.qcow2.md5sum",
               "url": "http://httpd-http/images/stub_image.qcow2"
            }
         }
      }
      
  6. Verify that the payload data is valid:

    cat status_payload.json | jq
    cat spec_payload.json | jq
    

    The system response must contain the data added in the previous step.

  7. Patch the bare metal host status with payload:

    curl -k -v -XPATCH -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" --data-binary "@status_payload.json" 127.0.0.1:8001/apis/metal3.io/v1alpha1/namespaces/default/baremetalhosts/master-2/status
    
  8. Patch the bare metal host spec with payload:

    kubectl patch bmh master-2 --type=merge --patch "$(cat spec_payload.json)"
    
  9. Resume the reconcile:

    kubectl patch bmh master-2 --type=merge --patch '{"metadata":{"annotations":{"baremetalhost.metal3.io/paused":null}}}'
    
  10. Close the terminal to quit kube-proxy and resume Lens.

[17981] Failure to redeploy a bare metal node with RAID 1

Fixed in 2.13.0

Redeployment of a bare metal node with an mdadm-based raid1 enabled fails due to insufficient cleanup of RAID devices.

Workaround:

  1. Boot the affected node from any LiveCD, preferably Ubuntu.

  2. Obtain details about the mdadm RAID devices:

    sudo mdadm --detail --scan --verbose
    
  3. Stop all mdadm RAID devices listed in the output of the above command. For example:

    sudo mdadm --stop /dev/md0
    
  4. Clean up the metadata on partitions with the mdadm RAID device(s) enabled. For example:

    sudo mdadm --zero-superblock /dev/sda1
    

    In the above example, replace /dev/sda1 with partitions listed in the output of the command provided in the step 2.


[17960] Overflow of the Ironic storage volume

Fixed in 2.13.0

On the baremetal-based management clusters with the Container Cloud version 2.12.0 or earlier, the storage volume used by Ironic can run out of free space. As a result, a StackLight alert is triggered for the ironic-aio-pvc volume filling up.

Symptoms

One or more of the following symptoms are observed:

  • The StackLight KubePersistentVolumeUsageCritical alert is firing for the volume ironic-aio-pvc.

  • The ironic and dnsmasq Deployments are not in the OK status:

    kubectl -n kaas get deployments
    
  • One or multiple ironic and dnsmasq pods fail to start:

    • For dnsmasq:

      kubectl get pods -n kaas -o wide | grep dnsmasq
      

      If the number of ready containers for the pod is not 2/2, the management cluster can be affected by the issue.

    • For ironic:

      kubectl get pods -n kaas -o wide | grep ironic
      

      If the number of ready containers for the pod is not 6/6, the management cluster can be affected by the issue.

  • The free space on a volume is less than 10%. To verify space usage on a volume:

    kubectl -n kaas exec -ti deployment/ironic -c ironic-api -- /bin/bash -c 'df -h |grep -i "volume\|size"'
    

    Example of system response where 14% is the used space of a volume:

    Filesystem                 Size  Used Avail Use% Mounted on
    /dev/rbd0                  4.9G  686M  4.2G  14% /volume
    

As a workaround, truncate the log files on the storage volume:

kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ironic/ironic-api.log'
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ironic/ironic-conductor.log'
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ironic/ansible-playbook.log'
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ironic-inspector/ironic-inspector.log'
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/dnsmasq/dnsmasq-dhcpd.log'
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ambassador/access.log
kubectl -n kaas exec -ti deployment/dnsmasq -- /bin/bash -c 'truncate -s 0 /volume/log/ambassador/error.log

[17792] Full preflight fails with a timeout waiting for BareMetalHost

If you run bootstrap.sh preflight with KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:

preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition

Workaround:

  1. Unset full preflight using the unset KAAS_BM_FULL_PREFLIGHT environment variable.

  2. Rerun bootstrap.sh preflight that executes fast preflight instead.


OpenStack

[10424] Regional cluster cleanup fails by timeout

An OpenStack-based regional cluster cleanup fails with the timeout error.

Workaround:

  1. Wait for the Cluster object to be deleted in the bootstrap cluster:

    kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
    

    The system output must be empty.

  2. Remove the bootstrap cluster manually:

    ./bin/kind delete cluster --name clusterapi
    


vSphere

[14080] Node leaves the cluster after IP address change

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.

The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.

By default, a DHCP server keeps lease of the IP address for 30 minutes. Usually, a VM dhclient prolongs such lease by frequent DHCP requests to the server before the lease period ends. The DHCP prolongation request period is always less than the default lease time on the DHCP server, so prolongation usually works. But in case of network issues, for example, when dhclient from the VM cannot reach the DHCP server, or the VM is being slowly powered on for more than the lease time, such VM may lose its assigned IP address. As a result, it obtains a new IP address.

Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.

Symptoms:

  • One of the nodes is in the NodeNotReady or down state:

    kubectl get nodes -o wide
    docker node ls
    
  • The UCP Swarm manager logs on the healthy manager node contain the following example error:

    docker logs -f ucp-swarm-manager
    
    level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
    
  • If the affected node is manager:

    • The output of the docker info command contains the following example error:

      Error: rpc error: code = Unknown desc = The swarm does not have a leader. \
      It's possible that too few managers are online. \
      Make sure more than half of the managers are online.
      
    • The UCP controller logs contain the following example error:

      docker logs -f ucp-controller
      
      "warning","msg":"Node State Active check error: \
      Swarm Mode Manager health check error: \
      info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \
      Is the docker daemon running?
      
  • On the affected node, the IP address on the first interface eth0 does not match the IP address configured in Docker. Verify the Node Address field in the output of the docker info command.

  • The following lines are present in /var/log/messages:

    dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
    

    If there are several lines where the IP is different, the node is affected.

Workaround:

Select from the following options:

  • Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.

  • Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface eth0 on VMs with a static IP address.

  • If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.


LCM

[16146] Stuck kubelet on the Cluster release 5.x.x series

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

Occasionally, kubelet may get stuck on the Cluster release 5.x.x series with different errors in the ucp-kubelet containers leading to the nodes failures. The following error occurs every time when accessing the Kubernetes API server:

an error on the server ("") has prevented the request from succeeding

As a workaround, restart ucp-kubelet on the failed node:

ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet

[6066] Helm releases get stuck in FAILED or UNKNOWN state

Note

The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

  1. Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

    kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}
    
    In the command above and in the steps below, replace the parameters
    enclosed in angle brackets with the corresponding values of your cluster.
    

    Example of system response:

    stacklight:
    attempt: 2
    chart: ""
    finishedAt: "2021-02-05T09:41:05Z"
    hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
    message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
      updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
      \"helmbundles.lcm.mirantis.com\" already exists"}]'
    notes: ""
    status: UNKNOWN
    success: false
    version: 0.1.2-mcp-398
    
  2. Log in to the helm-controller pod console:

    kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
    
  3. Download the Helm v3 binary. For details, see official Helm documentation.

  4. Remove the failed release:

    helm delete <failed-release-name>
    

    For example:

    helm delete stacklight
    

    Once done, the release triggers for redeployment.



IAM

[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page

Fixed in 2.18.0

During configuration of an identity provider SAML using the Add identity provider menu of the Keycloak admin console, the page style breaks as well as the Save and Cancel buttons disappear.

Workaround:

  1. Log in to the Keycloak admin console.

  2. In the sidebar menu, switch to the Master realm.

  3. Navigate to Realm Settings > Themes.

  4. In the Admin Console Theme drop-down menu, select keycloak.

  5. Click Save and refresh the browser window to apply the changes.


StackLight

[17771] Watchdog alert missing in Salesforce route

Fixed in 2.13.0

The Watchdog alert is not routed to Salesforce by default.

Note

After applying the workaround, you may notice the following warning message. It is expected and does not affect configuration rendering:

Warning: Merging destination map for chart 'stacklight'. Overwriting table
item 'match', with non table value: []

Workaround:

  1. Open the StackLight configuration manifest as described in StackLight configuration procedure.

  2. In alertmanagerSimpleConfig.salesForce, specify the following configuration:

    alertmanagerSimpleConfig:
      salesForce:
        route:
          match: []
          match_re:
            severity: "informational|critical"
          matchers:
          - severity=~"informational|critical"
    

[19682] URLs in Salesforce alerts use HTTP for IAM with enabled TLS

Fixed in 2.15.0

Prometheus web UI URLs in StackLight notifications sent to Salesforce use a wrong protocol: HTTP instead of HTTPS. The issue affects deployments with TLS enabled for IAM.

The workaround is to manually change the URL protocol in the web browser.


Storage

[20312] Creation of ceph-based PVs gets stuck in Pending state

The csi-rbdplugin-provisioner pod (csi-provisioner container) may show constant retries attempting to create a PV if the csi-rbdplugin-provisioner pod was scheduled and started on a node with no connectivity to the Ceph storage. As a result, creation of a Ceph-based persistent volume (PV) may get stuck in the Pending state.

As a workaround manually specify the affinity or toleration rules for the csi-rbdplugin-provisioner pod.

Workaround:

  1. On the managed cluster, open the rook-ceph-operator-config map for editing:

    kubectl edit configmap -n rook-ceph rook-ceph-operator-config
    
  2. To avoid spawning pods on the nodes where this is not needed, set the provisioner node affinity specifying the required node labels. For example:

    CSI_PROVISIONER_NODE_AFFINITY: "role=storage-node; storage=rook, ceph"
    

Note

If needed, you can also specify CSI_PROVISIONER_TOLERATIONS tolerations. For example:

CSI_PROVISIONER_TOLERATIONS: |
  - effect: NoSchedule
    key: node-role.kubernetes.io/controlplane
    operator: Exists
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    operator: Exists

[18879] The RGW pod overrides the global CA bundle with an incorrect mount

Fixed in 2.14.0

During deployment of a Ceph cluster, the RADOS Gateway (RGW) pod overrides the global CA bundle located at /etc/pki/tls/certs with an incorrect self-signed CA bundle. The issue affects only clusters with public certificates.

Workaround:

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with a corresponding value.

  2. Select from the following options:

    • If you are using the GoDaddy certificates, in the cephClusterSpec.objectStorage.rgw section, replace the cacert parameters with your public CA certificate that already contains both the root CA certificate and intermediate CA certificate:

      cephClusterSpec:
        objectStorage:
          rgw:
            SSLCert:
              cacert: |
                -----BEGIN CERTIFICATE-----
                ca-certificate here
                -----END CERTIFICATE-----
              tlsCert: |
                -----BEGIN CERTIFICATE-----
                private TLS certificate here
                -----END CERTIFICATE-----
              tlsKey: |
                -----BEGIN RSA PRIVATE KEY-----
                private TLS key here
                -----END RSA PRIVATE KEY-----
      
    • If you are using the DigiCert certificates:

      1. Download the <root_CA> from DigiCert.

      2. In the cephClusterSpec.objectStorage.rgw section, replace the cacert parameters with your public intermediate CA certificate along with the root one:

        cephClusterSpec:
          objectStorage:
            rgw:
              SSLCert:
                cacert: |
                  -----BEGIN CERTIFICATE-----
                  <root CA here>
                  <intermediate CA here>
                  -----END CERTIFICATE-----
                tlsCert: |
                  -----BEGIN CERTIFICATE-----
                  private TLS certificate here
                  -----END CERTIFICATE-----
                tlsKey: |
                  -----BEGIN RSA PRIVATE KEY-----
                  private TLS key here
                  -----END RSA PRIVATE KEY-----
        

[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13

Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1

Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and may cause data corruption. Therefore, manageOsds is disabled until further notice.

As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster, perform the steps described in Remove Ceph OSD manually.



Regional cluster

[17359] Deletion of AWS-based regional cluster credential fails

Fixed in 2.13.0

During deletion of an AWS-based regional cluster, deletion of the cluster credential fails with error deleting regional credential: error waiting for credential deletion: timed out waiting for the condition.

Workaround:

  1. Change the directory to kaas-bootstrap.

  2. Scale up the aws-credentials-controller-aws-credentials-controller deployment:

    ./bin/kind get kubeconfig --name clusterapi > kubeconfig-bootstrap
    
    kubectl --kubeconfig kubeconfig-bootstrap scale deployment \
    aws-credentials-controller-aws-credentials-controller \
    --namespace kaas --replicas=1
    
  3. Wait until the affected credential is deleted:

    kubectl --kubeconfig <pathToMgmtClusterKubeconfig> \
    get awscredentials.kaas.mirantis.com -A -l kaas.mirantis.com/region=<regionName>
    

    In the above command, replace:

    • <regionName> with the name of the region where the regional cluster is located.

    • <pathToMgmtClusterKubeconfig> with the path to the corresponding

      management cluster kubeconfig.

    Example of a positive system response:

    No resources found
    
  4. Delete the bootstrap cluster:

    ./bin/kind delete cluster --name clusterapi
    


Upgrade

[18193] Management cluster upgrade fails with Ceph cluster being not ready

Fixed in 2.13.0

An Equinix Metal or baremetal-based management cluster upgrade may fail with the following error message:

Reconcile MiraCeph 'ceph-lcm-mirantis/rook-ceph' failed with error:
failed to ensure cephcluster: failed to ensure cephcluster rook-ceph/rook-ceph:
ceph cluster rook-ceph/rook-ceph is not ready to be updated

Your cluster is affected if:

  1. The rook-ceph/rook-ceph-operator logs contain the following errors:

    Failed to update lock: Internal error occurred:
    unable to unmarshal response in forceLegacy: json:
    cannot unmarshal number into Go value of type bool
    
    Failed to update lock: Internal error occurred:
    unable to perform request for determining if legacy behavior should be forced
    
  2. The kubectl -n rook-ceph get cephcluster command returns the cephcluster resource with the Progressing state.

As a workaround, restart the rook-ceph-operator pod:

kubectl -n rook-ceph delete pod -l app=rook-ceph-operator

[4288] Equinix and MOS managed clusters update failure

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

The Equinix Metal and MOS-based managed clusters may fail to update to the latest Cluster release with kubelet being stuck and reporting authorization errors.

The cluster is affected by the issue if you see the Failed to make webhook authorizer request: context canceled error in the kubelet logs:

docker logs ucp-kubelet --since 5m 2>&1 | grep 'Failed to make webhook authorizer request: context canceled'

As a workaround, restart the ucp-kubelet container on the affected node(s):

ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet

Note

Ignore failures in the output of the first command, if any.


[16379,23865] Cluster update fails with the FailedMount warning

Fixed in 2.19.0

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

  1. Verify that the description of the pods that failed to run contain the FailedMount events:

    kubectl -n <affectedProjectName> describe pod <affectedPodName>
    
    • <affectedProjectName> is the Container Cloud project name where the pods failed to run

    • <affectedPodName> is a pod name that failed to run in this project

    In the pod description, identify the node name where the pod failed to run.

  2. Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.

    1. Identify csiPodName of the corresponding csi-rbdplugin:

      kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
      -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
      
    2. Output the affected csiPodName logs:

      kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
      
  3. Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

  4. On every csi-rbdplugin pod, search for stuck csi-vol:

    for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
      echo $pod
      kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
    done
    
  5. Unmap the affected csi-vol:

    rbd unmap -o force /dev/rbd<i>
    

    The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

  6. Delete volumeattachment of the affected pod:

    kubectl get volumeattachments | grep <csi-vol-uuid>
    kubectl delete volumeattacmhent <id>
    
  7. Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.


[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update

Fixed in 2.14.0

Helm releases may get stuck in the PENDING_UPGRADE status during a management or managed cluster upgrade. The HelmBundle Controller cannot recover from this state and requires manual actions. The workaround below describes the recovery process for the openstack-operator release that stuck during a managed cluster update. Use it as an example for other Helm releases as required.

Workaround:

  1. Log in to the helm-controller pod console:

    kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
    
  2. Identify the release that stuck in the PENDING_UPGRADE status. For example:

    ./helm --host=localhost:44134 history openstack-operator
    

    Example of system response:

    REVISION  UPDATED                   STATUS           CHART                      DESCRIPTION
    1         Tue Dec 15 12:30:41 2020  SUPERSEDED       openstack-operator-0.3.9   Install complete
    2         Tue Dec 15 12:32:05 2020  SUPERSEDED       openstack-operator-0.3.9   Upgrade complete
    3         Tue Dec 15 16:24:47 2020  PENDING_UPGRADE  openstack-operator-0.3.18  Preparing upgrade
    
  3. Roll back the failed release to the previous revision:

    1. Download the Helm v3 binary. For details, see official Helm documentation.

    2. Roll back the failed release:

      helm rollback <failed-release-name>
      

      For example:

      helm rollback openstack-operator 2
      

    Once done, the release will be reconciled.


[18076] StackLight update failure

Fixed in 2.13.0

On a managed cluster with logging disabled, changing NodeSelector can cause StackLight update failure with the following message in the StackLight Helm Controller logs:

Upgrade "stacklight" failed: Job.batch "stacklight-delete-logging-pvcs-*" is invalid: spec.template: Invalid value: ...

As a workaround, disable the stacklight-delete-logging-pvcs-* job.

Workaround:

  1. Open the affected Cluster object for editing:

    kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>
    
  2. Set deleteVolumes to false:

    spec:
      ...
      providerSpec:
        ...
        value:
          ...
          helmReleases:
            ...
            - name: stacklight
              values:
                ...
                logging:
                  deleteVolumes: false
                ...
    


Container Cloud web UI

[249] A newly created project does not display in the Container Cloud web UI

Affects only Container Cloud 2.18.0 and earlier

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.