Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.10.0 including the Cluster releases 7.0.0, 6.16.0, and 5.16.0.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
AWS¶
[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending
state
with the pod has unbound immediate PersistentVolumeClaims
and
node(s) had volume node affinity conflict
errors.
Warning
The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts of the affected pods:
kubectl get pod/<pod_name1> pod/<pod_name2> \ -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
Note
In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.
Delete the affected
Pods
andPersistentVolumeClaims
to reschedule them: For example, for StackLight:kubectl -n stacklight delete \ pod/<pod_name1> pod/<pod_name2> ... pvc/<pvc_name2> pvc/<pvc_name2> ...
Equinix Metal¶
[16718] Equinix Metal provider fails to create machines with SSH keys error¶
Fixed in 2.12.0
If an Equinix Metal based cluster is being deployed in an Equinix Metal project with no SSH keys, the Equinix Metal provider fails to create machines with the following error:
Failed to create machine "kaas-mgmt-controlplane-0"...
failed to create device: POST https://api.equinix.com/metal/v1/projects/...
<deviceID> must have at least one SSH key or explicitly send no_ssh_keys option
Workaround:
Create a new SSH key.
Log in to the Equinix Metal console.
In Project Settings, click Project SSH Keys.
Click Add New Key and add details of the newly created SSH key.
Click Add.
Restart the cluster deployment.
Bare metal¶
[17118] Failure to add a new machine to cluster¶
Fixed in 2.12.0
Adding a new machine to a baremetal-based
managed cluster may fail after the baremetal-based management cluster upgrade.
The issue occurs because the PXE boot is not working for the new node.
In this case, file /volume/tftpboot/ipxe.efi not found logs appear on
dnsmasq-tftp
.
Workaround:
Log in to a local machine where your management cluster
kubeconfig
is located and where kubectl is installed.Scale the Ironic deployment down to
0
replicas.kubectl -n kaas scale deployments/ironic --replicas=0
Scale the Ironic deployment up to
1
replica:kubectl -n kaas scale deployments/ironic --replicas=1
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost
object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost
object. Inspect the l2RenderResult
and ipAllocationResult
object fields for error messages.
OpenStack¶
[10424] Regional cluster cleanup fails by timeout¶
An OpenStack-based regional cluster cleanup fails with the timeout error.
Workaround:
Wait for the
Cluster
object to be deleted in the bootstrap cluster:kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
The system output must be empty.
Remove the bootstrap cluster manually:
./bin/kind delete cluster --name clusterapi
vSphere¶
[15698] VIP is assigned to each manager node instead of a single node¶
Fixed in 2.11.0
A load balancer virtual IP address (VIP) is assigned to each manager
node on any type of the vSphere-based cluster. The issue occurs because
the Keepalived instances cannot set up a cluster due to the blocked
vrrp
protocol traffic in the firewall configuration on the Container Cloud
nodes.
Note
Before applying the workaround below, verify that the dedicated
vSphere network does not have any other virtual machines with the
keepalived
instance running with the same vrouter_id
.
You can verify the vrouter_id
value of the cluster
in /etc/keepalived/keepalived.conf
on the manager nodes.
Workaround
Update the firewalld
configuration on each manager node of the affected
cluster to allow the vrrp
protocol traffic between the nodes:
SSH to any manager node using
mcc-user
.Apply the
firewalld
configuration:firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent firewall-cmd --reload
Apply the procedure to the remaining manager nodes of the cluster.
[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff
status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:349:
starting container process caused "process_linux.go:297:
applying cgroup configuration for process caused
"mkdir /sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannot allocate memory": unknown
The issue occurs due to the Kubernetes and Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem
to the kernel command line.
Note
The workaround below applies to the existing clusters only. The issue is resolved for new Container Cloud 2.9.0 deployments since the workaround below automatically applies to the VM template built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected cluster.
Workaround
SSH to any machine of the affected cluster using
mcc-user
and the SSH key provided during the cluster creation to proceed as theroot
user.In
/etc/default/grub
, setcgroup.memory=nokmem
forGRUB_CMDLINE_LINUX
.Update kernel:
yum install kernel kernel-headers kernel-tools kernel-tools-libs kexec-tools
Update the grub configuration:
grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot the machine.
Wait for the machine to become available.
Wait for 5 minutes for Docker and Kubernetes services to start.
Verify that the machine is
Ready
:docker node ls kubectl get nodes
Repeat the steps above on the remaining machines of the affected cluster.
[14080] Node leaves the cluster after IP address change¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient
prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient
from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.
Symptoms:
One of the nodes is in the
NodeNotReady
ordown
state:kubectl get nodes -o wide docker node ls
The UCP Swarm manager logs on the healthy manager node contain the following example error:
docker logs -f ucp-swarm-manager level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
If the affected node is manager:
The output of the docker info command contains the following example error:
Error: rpc error: code = Unknown desc = The swarm does not have a leader. \ It's possible that too few managers are online. \ Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
docker logs -f ucp-controller "warning","msg":"Node State Active check error: \ Swarm Mode Manager health check error: \ info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \ Is the docker daemon running?
On the affected node, the IP address on the first interface
eth0
does not match the IP address configured in Docker. Verify theNode Address
field in the output of the docker info command.The following lines are present in
/var/log/messages
:dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface
eth0
on VMs with a static IP address.If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.
LCM¶
[16146] Stuck kubelet on the Cluster release 5.x.x series¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
Occasionally, kubelet may get stuck on the Cluster release 5.x.x series
with different errors in the ucp-kubelet
containers leading to the nodes
failures. The following error occurs every time when accessing
the Kubernetes API server:
an error on the server ("") has prevented the request from succeeding
As a workaround, restart ucp-kubelet
on the failed node:
ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet
[8367] Adding of a new manager node to a managed cluster hangs on Deploy stage¶
Fixed in 2.12.0
Adding of a new manager node to a managed cluster may hang due to
issues with joining etcd from a new node to the existing etcd cluster.
The new manager node hangs in the Deploy
stage.
Symptoms:
The Ansible run tries executing the
Wait for Docker UCP to be accessible
step and fails with the following error message:Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
The etcd logs on the leader etcd node contain the following example error message occurring every 1-2 minutes:
2021-06-10 03:21:53.196677 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:22bb1d4275f1c5b0 RaftAttributes:{PeerURLs:[https://<new manager IP>:12380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
To determine the etcd leader, run on any manager node:
docker exec -it ucp-kv sh # From the inside of the container: ETCDCTL_API=3 etcdctl -w table --endpoints=https://<1st manager IP>:12379,https://<2nd manager IP>:12379,https://<3rd manager IP>:12379 endpoint status
To verify logs on the leader node:
docker logs ucp-kv
Root cause:
In case of an unlucky network partition, the leader may lose quorum and members are not able to perform the election. For more details, see Official etcd documentation: Learning, figure 5.
Workaround:
Restart etcd on the leader node:
docker rm -f ucp-kv
Wait several minutes until the etcd cluster starts and reconciles.
The deployment of the new manager node will proceed and it will join the etcd cluster. After that, other MKE components will be configured and the node deployment will be finished successfully.
[13303] Managed cluster update fails with the Network is unreachable error¶
Fixed in 2.11
A managed cluster update from the Cluster release 6.12.0 to
6.14.0 fails with worker nodes being stuck in the Deploy
state with
the Network is unreachable
error.
Workaround:
Verify the state of the loopback network interface:
ip l show lo
If the interface is not in the
UNKNOWN
orUP
state, enable it manually:ip l set lo up
If the interface is in the
UNKNOWN
orUP
state, assess the cluster logs to identify the failure root cause.Repeat the cluster update procedure.
[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0, the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error: 'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent:
x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier are affected.
As a workaround, restart lcm-agent
using the
service lcm-agent-* restart command on the affected nodes.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED
or UNKNOWN
state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready
and
providerStatus.helm.releaseStatuses.<releaseName>.success
are false
.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the
UNKNOWN
orFAILED
status in the HelmBundle object:kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight} In the command above and in the steps below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.
Example of system response:
stacklight: attempt: 2 chart: "" finishedAt: "2021-02-05T09:41:05Z" hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623 message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io \"helmbundles.lcm.mirantis.com\" already exists"}]' notes: "" status: UNKNOWN success: false version: 0.1.2-mcp-398
Log in to the
helm-controller
pod console:kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
helm delete <failed-release-name>
For example:
helm delete stacklight
Once done, the release triggers for redeployment.
IAM¶
[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectl describe pod -n kaas <failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message, proceed to the next step. Otherwise, assess the MariaDB pod logs to identify the failure root cause.
Obtain the MariaDB admin password:
kubectl get secret -n kaas mariadb-dbadmin-password -o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d ; echo
Verify that
wsrep_local_state_comment
isDonor
orDesynced
:kubectl exec -it -n kaas <failedMariadbPodName> -- mysql -uroot -p<mariadbAdminPassword> -e "SHOW status LIKE \"wsrep_local_state_comment\";"
Restart the failed pod:
kubectl delete pod -n kaas <failedMariadbPodName>
StackLight¶
[16843] Inability to override default route matchers for Salesforce notifier¶
Fixed in 2.12.0
It may be impossible to override the default route matchers for Salesforce notifier.
Note
After applying the workaround, you may notice the following warning message. It is expected and does not affect configuration rendering:
Warning: Merging destination map for chart 'stacklight'. Overwriting table
item 'match', with non table value: []
Workaround:
Open the StackLight configuration manifest as described in StackLight configuration procedure.
In
alertmanagerSimpleConfig.salesForce
, specify the following configuration:alertmanagerSimpleConfig: salesForce: route: match: [] match_re: your_matcher_key1: your_matcher_value1 your_matcher_key2: your_matcher_value2 ...
[17771] Watchdog alert missing in Salesforce route¶
Fixed in 2.13.0
The Watchdog alert is not routed to Salesforce by default.
Note
After applying the workaround, you may notice the following warning message. It is expected and does not affect configuration rendering:
Warning: Merging destination map for chart 'stacklight'. Overwriting table
item 'match', with non table value: []
Workaround:
Open the StackLight configuration manifest as described in StackLight configuration procedure.
In
alertmanagerSimpleConfig.salesForce
, specify the following configuration:alertmanagerSimpleConfig: salesForce: route: match: [] match_re: severity: "informational|critical" matchers: - severity=~"informational|critical"
Storage¶
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile
, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff
state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export
kubeconfig
of your managed cluster. For example:export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Delete the authorization key for the failed Ceph OSD:
ceph auth del osd.<ID>
SSH to the node on which the Ceph OSD cannot be created.
Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.
Note
Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.
On the managed cluster, restart Rook Operator:
kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
Bootstrap¶
[16873] Bootstrap fails with ‘failed to establish connection with tiller’ error¶
Fixed in 2.12.0
If the latest Ubuntu 18.04 image, for example, with kernel 4.15.0-153-generic, is installed on the bootstrap node, a management cluster bootstrap fails during the setup of the Kubernetes cluster by kind.
The issue occurs since the kind version 0.9.0 delivered with the bootstrap script is not compatible with the latest Ubuntu 18.04 image that requires kind version 0.11.1.
To verify that the bootstrap node is affected by the issue:
In the bootstrap script stdout, verify the connection to Tiller.
Example of system response extract on an affected bootstrap node:
clusterdeployer.go:164] Initialize Tiller in bootstrap cluster. bootstrap_create.go:64] unable to initialize Tiller in bootstrap cluster: \ failed to establish connection with tiller
In the bootstrap script stdout, identify the step after which the bootstrap process fails.
Example of system response extract on an affected bootstrap node:
clusterdeployer.go:128] Connecting to bootstrap cluster
In the kind cluster, verify the
kube-proxy
service readiness:./bin/kind get kubeconfig --name clusterapi > /tmp/kind_kubeconfig.yaml ./bin/kubectl --kubeconfig /tmp/kind_kubeconfig.yaml get po -n kube-system | grep kube-proxy ./bin/kubectl --kubeconfig /tmp/kind_kubeconfig.yaml-n kube-system logs kube-proxy-<podPostfixID>
Example of the
kube-proxy
service stdout extract on an affected bootstrap node:I0831 11:56:16.139300 1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072 F0831 11:56:16.139313 1 server.go:497] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied
If the verification steps below are positive, proceed with the workaround below.
Workaround:
Clean up the bootstrap cluster:
./bin/kind delete cluster --name clusterapi
Upgrade the kind binary to version 0.11.1:
curl -L https://github.com/kubernetes-sigs/kind/releases/download/v0.11.1/kind-linux-amd64 -o bin/kind chmod a+x bin/kind
Restart the bootstrap script:
./bootstrap.sh all
Upgrade¶
[16233] Bare metal pods fail during upgrade due to Ceph not unmounting RBD¶
Fixed in 2.11.0
A baremetal-based management cluster upgrade can fail with stuck ironic
and dnsmasq
pods. The issue may occur due to the Ceph pre-upgraded
persistent volumes being unmapped incorrectly. As a result, the RBD volumes
mounts on nodes are without any real RBD volumes.
Symptoms:
The
ironic
anddnsmasq
deployments fail:kubectl -n kaas get deploy
Example of system response:
NAME READY UP-TO-DATE AVAILABLE AGE ironic 0/1 0 0 6d10h dnsmasq 0/1 0 0 6d10h
The bare metal
mariadb
andhttpd
statefulSets
fail:kubectl -n kaas get statefulset
Example output:
NAME READY AGE httpd 0/1 6d10h mariadb 0/1 6d10h
On the failed deployments pods, the ll /volume command hangs or outputs the
input/output error
:Enter any pod of the failed deployment:
kubectl -n kaas exec -it <podName> -- bash
Replace
<podName>
with the affected pod name. For example,httpd-0
.Obtain the list of files in the
/volume
directory:ll /volume
Example of system response:
ls: reading directory '.': Input/output error
If the above command gets stuck or outputs the
Input/output error
error, the issue relates to theceph-csi
unmounted RBD devices.
Workaround:
Identify the names of nodes with the affected pods:
kubectl -n kaas get pod <podName> -o jsonpath='{.spec.nodeName}'
Replace
<podName>
with the affected pod name.Identify which
csi-rbdplugin
pod is assigned to which node:kubectl -n rook-ceph get pod -l app=csi-rbdplugin -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.spec.nodeName}{"\n"}'
Enter any affected
csi-rbdplugin
pod.kubectl -n rook-ceph exec -it <csiPodName> -c csi-rbdplugin -- bash
Identify the mapped device classes on this pod:
rbd device list
Identify which devices are mounted on this pod:
mount | grep rbd
Unmount all devices that are not included into the rbd device list command output:
umount <rbdDeviceName>
Replace
<rbdDeviceName>
with a mounted RBD device name that is not included into the rbd device list output. For example,/dev/rbd0
.Exit the
csi-rbdplugin
pod:exit
Repeat the steps above for the remaining affected
csi-rbdplugin
pods on every affected node.Once all nonexistent mounts are unmounted on all nodes, restart the stuck deployments:
kubectl -n kaas get deploy kubectl -n kaas scale deploy <deploymentName> --replicas 0 kubectl -n kaas scale deploy <deploymentName> --replicas <replicasNumber>
<deploymentName>
is a stuck bare metal deployment name, for example,ironic
<replicasNumber>
is the original number of replicas for the deployment that you can obtain using the get deploy command
Restart the failed bare metal
statefulSets
:kubectl -n kaas get statefulset kubectl -n kaas scale statefulset <statefulSetName> --replicas 0 kubectl -n kaas scale statefulset <statefulSetName> --replicas <replicasNumber>
<statefulSetName>
is a failed bare metalstatefulSet
name, for example,mariadb
<replicasNumber>
is the original number of replicas for thestatefulSet
that you can obtain using the get statefulset command
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume
and FailedMount
warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>
is the Container Cloud project name where the pods failed to run<affectedPodName>
is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the pod that fails to init to0
replicas.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state isRunning
.
[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE
status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator
release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Workaround:
Log in to the
helm-controller
pod console:kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
Identify the release that stuck in the
PENDING_UPGRADE
status. For example:./helm --host=localhost:44134 history openstack-operator
Example of system response:
REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Dec 15 12:30:41 2020 SUPERSEDED openstack-operator-0.3.9 Install complete 2 Tue Dec 15 12:32:05 2020 SUPERSEDED openstack-operator-0.3.9 Upgrade complete 3 Tue Dec 15 16:24:47 2020 PENDING_UPGRADE openstack-operator-0.3.18 Preparing upgrade
Roll back the failed release to the previous revision:
Download the Helm v3 binary. For details, see official Helm documentation.
Roll back the failed release:
helm rollback <failed-release-name>
For example:
helm rollback openstack-operator 2
Once done, the release will be reconciled.
[15766] Cluster upgrade failure¶
Fixed in 2.11.0
Upgrade of a Container Cloud management or regional cluster from version 2.9.0
to 2.10.0 and managed cluster from 5.16.0 to 5.17.0 may fail with the following
error message for the patroni-12-0
, patroni-12-1
or patroni-12-2
pod.
error when evicting pods/"patroni-12-2" -n "stacklight" (will retry after 5s):
Cannot evict pod as it would violate the pod's disruption budget.
As a workaround, reinitialize the Patroni pod that got stuck:
kubectl -n stacklight exec -ti -c patroni $(kubectl -n stacklight \
get ep/patroni-12 -o jsonpath='{.metadata.annotations.leader}') -- \
patronictl reinit patroni-12 <POD_NAME> --force --wait
Substitute <POD_NAME>
with the name of the Patroni pod from the error
message. For example:
kubectl -n stacklight exec -ti -c patroni $(kubectl -n stacklight \
get ep/patroni-12 -o jsonpath='{.metadata.annotations.leader}') -- \
patronictl reinit patroni-12 patroni-12-2
If the command above fails, reinitialize the affected pod with a new volume by
deleting the pod itself and the associated PersistentVolumeClaim
(PVC):
Obtain the PVC of the affected pod:
kubectl -n stacklight get "pod/<POD_NAME>" -o jsonpath='{.spec.volumes[?(@.name=="storage-volume")].persistentVolumeClaim.claimName}'
Delete the affected pod and its PVC:
kubectl -n stacklight delete "pod/<POD_NAME>" "pvc/<POD_PVC>" sleep 3 # wait for StatefulSet to reschedule the pod, but miss dependent PVC creation kubectl -n stacklight delete "pod/<POD_NAME>"
[16141] Alertmanager pod gets stuck in CrashLoopBackOff during upgrade¶
Fixed in 2.11.0
An Alertmanager pod may get stuck in the CrashLoopBackOff
state during
upgrade of a management, regional, or managed cluster and thus cause upgrade
failure with the Loading configuration file failed error message in logs.
Workaround:
Delete the Alertmanager pod that is stuck in the
CrashLoopBackOff
state. For example:kubectl delete pod/prometheus-alertmanager-1 -n stacklight
Wait for several minutes and verify that Alertmanager and its pods are up and running:
kubectl get all -n stacklight -l app=prometheus,component=alertmanager
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.