Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.11.0 including the Cluster releases 7.1.0, 6.18.0, and 5.18.0.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

AWS
Equinix Metal
Bare metal
OpenStack

vSphere
LCM
IAM
StackLight

Storage
Bootstrap
Upgrade
Container Cloud web UI

AWS¶

[8013] Managed cluster deployment requiring PVs may fail¶

^{Fixed in the Cluster release 7.0.0}

Note

The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

On a management cluster with multiple AWS-based managed clusters, some clusters fail to complete the deployments that require persistent volumes (PVs), for example, Elasticsearch. Some of the affected pods get stuck in the Pending state with the pod has unbound immediate PersistentVolumeClaims and node(s) had volume node affinity conflict errors.

Warning

The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.

Workaround:

Obtain the persistent volume claims related to the storage mounts of the affected pods:
```
kubectl get pod/<pod_name1> pod/<pod_name2> \
-o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
```
Note

In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.

Delete the affected Pods and PersistentVolumeClaims to reschedule them: For example, for StackLight:

kubectl -n stacklight delete \

  pod/<pod_name1> pod/<pod_name2> ...
  pvc/<pvc_name2> pvc/<pvc_name2> ...

Equinix Metal¶

[16718] Equinix Metal provider fails to create machines with SSH keys error¶

^{Fixed in 2.12.0}

If an Equinix Metal based cluster is being deployed in an Equinix Metal project with no SSH keys, the Equinix Metal provider fails to create machines with the following error:

Failed to create machine "kaas-mgmt-controlplane-0"...
failed to create device: POST https://api.equinix.com/metal/v1/projects/...
<deviceID> must have at least one SSH key or explicitly send no_ssh_keys option

Workaround:

Create a new SSH key.
Log in to the Equinix Metal console.
In Project Settings, click Project SSH Keys.
Click Add New Key and add details of the newly created SSH key.
Click Add.
Restart the cluster deployment.

Bare metal¶

[17118] Failure to add a new machine to cluster¶

^{Fixed in 2.12.0}

Adding a new machine to a baremetal-based managed cluster may fail after the baremetal-based management cluster upgrade. The issue occurs because the PXE boot is not working for the new node. In this case, file /volume/tftpboot/ipxe.efi not found logs appear on dnsmasq-tftp.

Workaround:

Scale the Ironic deployment down to 0 replicas.

kubectl -n kaas scale deployments/ironic --replicas=0

Scale the Ironic deployment up to 1 replica:

kubectl -n kaas scale deployments/ironic --replicas=1

OpenStack¶

[16959] Proxy-based regional cluster creation fails¶

^{Fixed in 2.12.0}

An OpenStack-based regional cluster being deployed using proxy fails with the Not ready objects: not ready: statefulSets: kaas/mcc-cache got 0/1 replicas error message due to the issue with the proxy secret creation.

Workaround:

Run the following command:

kubectl get secret -n kube-system mke-proxy-secret -o yaml | sed '/namespace.*/d' | kubectl create -n kaas -f -

Rerun the bootstrap script:
```
./bootstrap.sh deploy_regional
```

[10424] Regional cluster cleanup fails by timeout¶

An OpenStack-based regional cluster cleanup fails with the timeout error.

Workaround:

Wait for the Cluster object to be deleted in the bootstrap cluster:
```
kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
```
The system output must be empty.

Remove the bootstrap cluster manually:

./bin/kind delete cluster --name clusterapi

vSphere¶

[14458] Failure to create a container for pod: cannot allocate memory¶

^{Fixed in 2.9.0 for new clusters}

Newly created pods may fail to run and have the CrashLoopBackOff status on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware vSphere provider. The following is an example output of the kubectl describe pod <pod-name> -n <projectName> command:

State:        Waiting
Reason:       CrashLoopBackOff
Last State:   Terminated
Reason:       ContainerCannotRun
Message:      OCI runtime create failed: container_linux.go:349:
              starting container process caused "process_linux.go:297:
              applying cgroup configuration for process caused
              "mkdir /sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
              cannot allocate memory": unknown

The issue occurs due to the Kubernetes and Docker community issues.

According to the RedHat solution, the workaround is to disable the kernel memory accounting feature by appending cgroup.memory=nokmem to the kernel command line.

Note

The workaround below applies to the existing clusters only. The issue is resolved for new Container Cloud 2.9.0 deployments since the workaround below automatically applies to the VM template built during the vSphere-based management cluster bootstrap.

Apply the following workaround on each machine of the affected cluster.

Workaround

SSH to any machine of the affected cluster using mcc-user and the SSH key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for GRUB_CMDLINE_LINUX.

Update kernel:

yum install kernel kernel-headers kernel-tools kernel-tools-libs kexec-tools

Update the grub configuration:
```
grub2-mkconfig -o /boot/grub2/grub.cfg
```
Reboot the machine.
Wait for the machine to become available.
Wait for 5 minutes for Docker and Kubernetes services to start.
Verify that the machine is Ready:
```
docker node ls
kubectl get nodes
```
Repeat the steps above on the remaining machines of the affected cluster.

[14080] Node leaves the cluster after IP address change¶

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.

The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.

By default, a DHCP server keeps lease of the IP address for 30 minutes. Usually, a VM dhclient prolongs such lease by frequent DHCP requests to the server before the lease period ends. The DHCP prolongation request period is always less than the default lease time on the DHCP server, so prolongation usually works. But in case of network issues, for example, when dhclient from the VM cannot reach the DHCP server, or the VM is being slowly powered on for more than the lease time, such VM may lose its assigned IP address. As a result, it obtains a new IP address.

Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.

Symptoms:

One of the nodes is in the NodeNotReady or down state:
```
kubectl get nodes -o wide
docker node ls
```

The UCP Swarm manager logs on the healthy manager node contain the following example error:

docker logs -f ucp-swarm-manager

level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"

If the affected node is manager:

The output of the docker info command contains the following example error:

Error: rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.

The UCP controller logs contain the following example error:

docker logs -f ucp-controller

"warning","msg":"Node State Active check error: \
Swarm Mode Manager health check error: \
info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \
Is the docker daemon running?

On the affected node, the IP address on the first interface eth0 does not match the IP address configured in Docker. Verify the Node Address field in the output of the docker info command.
The following lines are present in /var/log/messages:
```
dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
```
If there are several lines where the IP is different, the node is affected.

Workaround:

Select from the following options:

Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface eth0 on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.

LCM¶

[16146] Stuck kubelet on the Cluster release 5.x.x series¶

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

Occasionally, kubelet may get stuck on the Cluster release 5.x.x series with different errors in the ucp-kubelet containers leading to the nodes failures. The following error occurs every time when accessing the Kubernetes API server:

an error on the server ("") has prevented the request from succeeding

As a workaround, restart ucp-kubelet on the failed node:

ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet

[8367] Adding of a new manager node to a managed cluster hangs on Deploy stage¶

^{Fixed in 2.12.0}

Adding of a new manager node to a managed cluster may hang due to issues with joining etcd from a new node to the existing etcd cluster. The new manager node hangs in the Deploy stage.

Symptoms:

The Ansible run tries executing the Wait for Docker UCP to be accessible step and fails with the following error message:
```
Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
```

The etcd logs on the leader etcd node contain the following example error message occurring every 1-2 minutes:

2021-06-10 03:21:53.196677 W | etcdserver: not healthy for reconfigure,
rejecting member add {ID:22bb1d4275f1c5b0 RaftAttributes:{PeerURLs:[https://<new manager IP>:12380]
IsLearner:false} Attributes:{Name: ClientURLs:[]}}

To determine the etcd leader, run on any manager node:

docker exec -it ucp-kv sh
# From the inside of the container:
ETCDCTL_API=3 etcdctl -w table --endpoints=https://<1st manager IP>:12379,https://<2nd manager IP>:12379,https://<3rd manager IP>:12379 endpoint status

To verify logs on the leader node:
```
docker logs ucp-kv
```

Root cause:

In case of an unlucky network partition, the leader may lose quorum and members are not able to perform the election. For more details, see Official etcd documentation: Learning, figure 5.

Workaround:

Restart etcd on the leader node:
```
docker rm -f ucp-kv
```
Wait several minutes until the etcd cluster starts and reconciles.

The deployment of the new manager node will proceed and it will join the etcd cluster. After that, other MKE components will be configured and the node deployment will be finished successfully.

[6066] Helm releases get stuck in FAILED or UNKNOWN state¶

Note

The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}

In the command above and in the steps below, replace the parameters
enclosed in angle brackets with the corresponding values of your cluster.

Example of system response:

stacklight:
attempt: 2
chart: ""
finishedAt: "2021-02-05T09:41:05Z"
hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
  updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
  \"helmbundles.lcm.mirantis.com\" already exists"}]'
notes: ""
status: UNKNOWN
success: false
version: 0.1.2-mcp-398

kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller

Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
```
helm delete <failed-release-name>
```
For example:
```
helm delete stacklight
```
Once done, the release triggers for redeployment.

IAM¶

[13385] MariaDB pods fail to start after SST sync¶

^{Fixed in 2.12.0}

The MariaDB pods fail to start after MariaDB blocks itself during the State Snapshot Transfers sync.

Workaround:

Verify the failed pod readiness:
```
kubectl describe pod -n kaas <failedMariadbPodName>
```
If the readiness probe failed with the WSREP not synced message, proceed to the next step. Otherwise, assess the MariaDB pod logs to identify the failure root cause.

Obtain the MariaDB admin password:

kubectl get secret -n kaas mariadb-dbadmin-password -o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d ; echo

Verify that wsrep_local_state_comment is Donor or Desynced:

kubectl exec -it -n kaas <failedMariadbPodName> -- mysql -uroot -p<mariadbAdminPassword> -e "SHOW status LIKE \"wsrep_local_state_comment\";"

Restart the failed pod:

kubectl delete pod -n kaas <failedMariadbPodName>

[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page¶

Fixed in 2.18.0

During configuration of an identity provider SAML using the Add identity provider menu of the Keycloak admin console, the page style breaks as well as the Save and Cancel buttons disappear.

Workaround:

Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select keycloak.
Click Save and refresh the browser window to apply the changes.

StackLight¶

[16843] Inability to override default route matchers for Salesforce notifier¶

^{Fixed in 2.12.0}

It may be impossible to override the default route matchers for Salesforce notifier.

Note

After applying the workaround, you may notice the following warning message. It is expected and does not affect configuration rendering:

Warning: Merging destination map for chart 'stacklight'. Overwriting table
item 'match', with non table value: []

Workaround:

Open the StackLight configuration manifest as described in StackLight configuration procedure.

In alertmanagerSimpleConfig.salesForce, specify the following configuration:

alertmanagerSimpleConfig:
  salesForce:
    route:
      match: []
      match_re:
        your_matcher_key1: your_matcher_value1
        your_matcher_key2: your_matcher_value2
        ...

[17771] Watchdog alert missing in Salesforce route¶

^{Fixed in 2.13.0}

The Watchdog alert is not routed to Salesforce by default.

Note

After applying the workaround, you may notice the following warning message. It is expected and does not affect configuration rendering:

Warning: Merging destination map for chart 'stacklight'. Overwriting table
item 'match', with non table value: []

Workaround:

Open the StackLight configuration manifest as described in StackLight configuration procedure.

In alertmanagerSimpleConfig.salesForce, specify the following configuration:

alertmanagerSimpleConfig:
  salesForce:
    route:
      match: []
      match_re:
        severity: "informational|critical"
      matchers:
      - severity=~"informational|critical"

Storage¶

[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13¶

^{Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1}

Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and may cause data corruption. Therefore, manageOsds is disabled until further notice.

As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster, perform the steps described in Remove Ceph OSD manually.

Bootstrap¶

[16873] Bootstrap fails with ‘failed to establish connection with tiller’ error¶

^{Fixed in 2.12.0}

If the latest Ubuntu 18.04 image, for example, with kernel 4.15.0-153-generic, is installed on the bootstrap node, a management cluster bootstrap fails during the setup of the Kubernetes cluster by kind.

The issue occurs since the kind version 0.9.0 delivered with the bootstrap script is not compatible with the latest Ubuntu 18.04 image that requires kind version 0.11.1.

To verify that the bootstrap node is affected by the issue:

In the bootstrap script stdout, verify the connection to Tiller.

Example of system response extract on an affected bootstrap node:

clusterdeployer.go:164] Initialize Tiller in bootstrap cluster.
bootstrap_create.go:64] unable to initialize Tiller in bootstrap cluster: \
failed to establish connection with tiller

In the bootstrap script stdout, identify the step after which the bootstrap process fails.

Example of system response extract on an affected bootstrap node:
```
clusterdeployer.go:128] Connecting to bootstrap cluster
```

In the kind cluster, verify the kube-proxy service readiness:

./bin/kind get kubeconfig --name clusterapi > /tmp/kind_kubeconfig.yaml

./bin/kubectl --kubeconfig /tmp/kind_kubeconfig.yaml get po -n kube-system | grep kube-proxy

./bin/kubectl --kubeconfig /tmp/kind_kubeconfig.yaml-n kube-system logs kube-proxy-<podPostfixID>

Example of the kube-proxy service stdout extract on an affected bootstrap node:

I0831 11:56:16.139300  1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
F0831 11:56:16.139313  1 server.go:497] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

If the verification steps below are positive, proceed with the workaround below.

Workaround:

Clean up the bootstrap cluster:

./bin/kind delete cluster --name clusterapi

Upgrade the kind binary to version 0.11.1:

curl -L https://github.com/kubernetes-sigs/kind/releases/download/v0.11.1/kind-linux-amd64 -o bin/kind

chmod a+x bin/kind

Restart the bootstrap script:
```
./bootstrap.sh all
```

Upgrade¶

[17477] StackLight in HA mode is not deployed or cluster update is blocked¶

^{Fixed in 2.12.0}

The deployment of new managed clusters using the Cluster release 6.18.0 with StackLight enabled in the HA mode on control plane nodes does not have StackLight deployed. The update of existing clusters with such StackLight configuration that were created using the Cluster release 6.16.0 is blocked with the following error message:

cluster release version upgrade is forbidden: \
Minimum number of worker machines with StackLight label is 3

Workaround:

On the affected managed cluster:
1. Create a key-value pair that will be used as a unique label on the cluster nodes. In our example, it is forcedRole: stacklight.
  
  To verify the labels names that already exist on the cluster nodes:
```
kubectl get nodes --show-labels
```
2. Add the new label to the target nodes for StackLight. For example, to the Kubernetes master nodes:
```
kubectl label nodes --selector=node-role.kubernetes.io/master forcedRole=stacklight
```
3. Verify that the new label is added:
```
kubectl get nodes --show-labels
```

On the related management cluster:

Configure nodeSelector for the StackLight components by modifying the affected Cluster object:

kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>

For example:

spec:
  ...
  providerSpec:
    ...
    value:
      ...
      helmReleases:
        ...
        - name: stacklight
          values:
            ...
            nodeSelector:
              default:
                forcedRole: stacklight

Select from the following options:
- If you faced the issue during a managed cluster deployment, skip this step.
- If you faced the issue during a managed cluster update, wait until all StackLight components resources are recreated on the target nodes with updated node selectors.
  
  To monitor the cluster status:
```
kubectl get cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName> -o jsonpath='{.status.providerStatus.conditions[?(@.type=="StackLight")]}' | jq
```
  In the cluster status, verify that the elasticsearch-master and prometheus-server resources are ready. The process can take up to 30 minutes.
  
  Example of a negative system response:
```
{
  "message": "not ready: statefulSets: stacklight/elasticsearch-master got 2/3 replicas",
  "ready": false,
  "type": "StackLight"
}
```

In the Container Cloud web UI, add a fake StackLight label to any 3 worker nodes to satisfy the deployment requirement as described in Create a machine using web UI. Eventually, StackLight will be still placed on the target nodes with the forcedRole: stacklight label.

Once done, the StackLight deployment or update proceeds.

[17412] Cluster upgrade fails on the KaaSCephCluster CRD update¶

An upgrade of a bare metal or Equinix metal based management cluster originally deployed using the Container Cloud release earlier than 2.8.0 fails with the following error message:

Upgrade "kaas-public-api" failed: \
cannot patch "kaascephclusters.kaas.mirantis.com" with kind \
CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io \
kaascephclusters.kaas.mirantis.com" is invalid: \
spec.preserveUnknownFields: Invalid value: true: \
must be false in order to use defaults in the schema

Workaround:

Change the preserveUnknownFields value for the KaaSCephCluster CRD to false:

kubectl patch crd kaascephclusters.kaas.mirantis.com -p '{"spec":{"preserveUnknownFields":false}}'

Upgrade kaas-public-api:

helm -n kaas upgrade kaas-public-api https://binary.mirantis.com/core/helm/kaas-public-api-1.24.6.tgz --reuse-values

[17069] Cluster upgrade fails with the ‘Failed to configure Ceph cluster’ error¶

^{Fixed in 2.12.0}

An upgrade of a bare metal or Equinix Metal based management or managed cluster fails with the following exemplary error messages:

- message: 'Failed to configure Ceph cluster: ceph cluster verification is failed:
  [PG_AVAILABILITY: Reduced data availability: 33 pgs inactive, OSD_DOWN: 3 osds
  down, OSD_HOST_DOWN: 3 hosts (3 osds) down, OSD_ROOT_DOWN: 1 root (3 osds) down,
  Not all Osds are up]'

- message: 'not ready: deployments: kaas/dnsmasq got 0/1 replicas, kaas/ironic got
    0/1 replicas, rook-ceph/rook-ceph-osd-0 got 0/1 replicas, rook-ceph/rook-ceph-osd-1
    got 0/1 replicas, rook-ceph/rook-ceph-osd-2 got 0/1 replicas; statefulSets: kaas/httpd
    got 0/1 replicas, kaas/mariadb got 0/1 replicas'
  ready: false
  type: Kubernetes

The cluster is affected by the issue if it has different Ceph versions installed:

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name) -- ceph versions

Example of system response:

"mon": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 3
},
"mgr": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1
},
"osd": {
    "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3
},
"mds": {},
"overall": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 4
    "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3
}

Additionally, the output may display no Ceph OSDs:

  "mon": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 3
  },
  "mgr": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 1
  },
  "osd": {},
  "mds": {},
  "overall": {
    "ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)": 4
  }

Workaround:

Manually update the image of each rook-ceph-osd deployment to http://mirantis.azurecr.io/ceph/ceph:v15.2.13:
```
kubectl -n rook-ceph edit deploy rook-ceph-osd-<i>
```
In the system output, grep 14.2.19 and replace with 15.2.13.

Verify that all OSDs for all rook-ceph-osd deployments have the 15.2.13 image version:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.containers[0].image}{"\n"}{end}'

Restart the rook-ceph-operator pod:

kubectl -n rook-ceph delete pod -l app=rook-ceph-operator

[17007] False-positive ‘release: “squid-proxy” not found’ error¶

^{Fixed in 2.12.0}

During a management cluster upgrade of any supported cloud provider except vSphere, you may notice the following false-positive messages for the squid-proxy Helm release that is disabled in Container Cloud 2.11.0:

Helm charts not installed yet: squid-proxy

Error: release: "squid-proxy" not found

Ignore these errors for any cloud provider except vSphere that continues using squid-proxy in Container Cloud 2.11.0.

[16964] Management cluster upgrade gets stuck¶

^{Fixed in 2.12.0}

Management cluster upgrade may get stuck and then fail with the following error message: ClusterWorkloadLocks in cluster default/kaas-mgmt are still active - ceph-clusterworkloadlock.

To verify that the cluster is affected:

Enter the ceph-tools pod.
Verify that some Ceph daemons were not upgraded to Octopus:
```
ceph versions
```

Run ceph -s and verify that the output contains the following health warning:

mons are allowing insecure global_id reclaim
clients are allowing insecure global_id reclaim

If the upgrade is stuck, some Ceph daemons are stuck on upgrade to Octopus, and the health warning above is present, perform the following steps.

Workaround:

Run the following commands:

ceph config set global mon_warn_on_insecure_global_id_reclaim false
ceph config set global mon_warn_on_insecure_global_id_reclaim_allowed false

Exit the ceph-tools pod.

Restart the rook-ceph-operator pod:

kubectl -n rook-ceph delete app=rook-ceph-operator

[16777] Cluster update fails due to Patroni being not ready¶

^{Fixed in 2.12.0}

An update of the Container Cloud management, regional, or managed cluster of any cloud provider type from the Cluster release 7.0.0 to 7.1.0 fails due to the failed Patroni pod.

As a workaround, increase the default resource requests and limits for PostgreSQL as follows:

resources:
  postgresql:
    requests:
      cpu: "256m"
      memory: "1Gi"
    limits:
      cpu: "512m"
      memory: "2Gi"

For details, see MOSK Operations Guide: StackLight configuration parameters - Resource limits.

[16379,23865] Cluster update fails with the FailedMount warning¶

Fixed in 2.19.0

An Equinix-based management or managed cluster fails to update with the FailedAttachVolume and FailedMount warnings.

Workaround:

Verify that the description of the pods that failed to run contain the FailedMount events:
```
kubectl -n <affectedProjectName> describe pod <affectedPodName>
```
- <affectedProjectName> is the Container Cloud project name where the pods failed to run
- <affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.
1. Identify csiPodName of the corresponding csi-rbdplugin:
```
kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
-o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
```
2. Output the affected csiPodName logs:
```
kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
```
Scale down the affected StatefulSet or Deployment of the pod that fails to init to 0 replicas.

On every csi-rbdplugin pod, search for stuck csi-vol:

for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
  echo $pod
  kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
done

Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```
The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

Delete volumeattachment of the affected pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state is Running.

[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶

Fixed in 2.14.0

Helm releases may get stuck in the PENDING_UPGRADE status during a management or managed cluster upgrade. The HelmBundle Controller cannot recover from this state and requires manual actions. The workaround below describes the recovery process for the openstack-operator release that stuck during a managed cluster update. Use it as an example for other Helm releases as required.

Workaround:

kubectl exec -n kube-system -it helm-controller-0 sh -c tiller

Identify the release that stuck in the PENDING_UPGRADE status. For example:

./helm --host=localhost:44134 history openstack-operator

Example of system response:

REVISION  UPDATED                   STATUS           CHART                      DESCRIPTION
       Tue Dec 15 12:30:41 2020  SUPERSEDED       openstack-operator-0.3.9   Install complete
       Tue Dec 15 12:32:05 2020  SUPERSEDED       openstack-operator-0.3.9   Upgrade complete
       Tue Dec 15 16:24:47 2020  PENDING_UPGRADE  openstack-operator-0.3.18  Preparing upgrade

Roll back the failed release to the previous revision:
1. Download the Helm v3 binary. For details, see official Helm documentation.
2. Roll back the failed release:
```
helm rollback <failed-release-name>
```
  For example:
```
helm rollback openstack-operator 2
```
Once done, the release will be reconciled.

[18076] StackLight update failure¶

^{Fixed in 2.13.0}

On a managed cluster with logging disabled, changing NodeSelector can cause StackLight update failure with the following message in the StackLight Helm Controller logs:

Upgrade "stacklight" failed: Job.batch "stacklight-delete-logging-pvcs-*" is invalid: spec.template: Invalid value: ...

As a workaround, disable the stacklight-delete-logging-pvcs-* job.

Workaround:

Open the affected Cluster object for editing:

kubectl edit cluster <affectedManagedClusterName> -n <affectedManagedClusterProjectName>

Set deleteVolumes to false:

spec:
  ...
  providerSpec:
    ...
    value:
      ...
      helmReleases:
        ...
        - name: stacklight
          values:
            ...
            logging:
              deleteVolumes: false
            ...

Container Cloud web UI¶

[249] A newly created project does not display in the Container Cloud web UI¶

Affects only Container Cloud 2.18.0 and earlier

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.