Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.7.0 including the Cluster release 5.14.0 and 6.14.0.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
AWS¶
[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending
state
with the pod has unbound immediate PersistentVolumeClaims
and
node(s) had volume node affinity conflict
errors.
Warning
The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts of the affected pods:
kubectl get pod/<pod_name1> pod/<pod_name2> \ -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
Note
In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.
Delete the affected
Pods
andPersistentVolumeClaims
to reschedule them: For example, for StackLight:kubectl -n stacklight delete \ pod/<pod_name1> pod/<pod_name2> ... pvc/<pvc_name2> pvc/<pvc_name2> ...
vSphere¶
[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff
status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:349:
starting container process caused "process_linux.go:297:
applying cgroup configuration for process caused
"mkdir /sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannot allocate memory": unknown
The issue occurs due to the Kubernetes and Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem
to the kernel command line.
Note
The workaround below applies to the existing clusters only. The issue is resolved for new Container Cloud 2.9.0 deployments since the workaround below automatically applies to the VM template built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected cluster.
Workaround
SSH to any machine of the affected cluster using
mcc-user
and the SSH key provided during the cluster creation to proceed as theroot
user.In
/etc/default/grub
, setcgroup.memory=nokmem
forGRUB_CMDLINE_LINUX
.Update kernel:
yum install kernel kernel-headers kernel-tools kernel-tools-libs kexec-tools
Update the grub configuration:
grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot the machine.
Wait for the machine to become available.
Wait for 5 minutes for Docker and Kubernetes services to start.
Verify that the machine is
Ready
:docker node ls kubectl get nodes
Repeat the steps above on the remaining machines of the affected cluster.
OpenStack¶
[10424] Regional cluster cleanup fails by timeout¶
An OpenStack-based regional cluster cleanup fails with the timeout error.
Workaround:
Wait for the
Cluster
object to be deleted in the bootstrap cluster:kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
The system output must be empty.
Remove the bootstrap cluster manually:
./bin/kind delete cluster --name clusterapi
Bare metal¶
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost
object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost
object. Inspect the l2RenderResult
and ipAllocationResult
object fields for error messages.
Storage¶
[7073] Cannot automatically remove a Ceph node¶
When removing a worker node, it is not possible to automatically remove a Ceph node. The workaround is to manually remove the Ceph node from the Ceph cluster as described in Operations Guide: Add, remove, or reconfigure Ceph nodes before removing the worker node from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile
, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff
state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export
kubeconfig
of your managed cluster. For example:export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Delete the authorization key for the failed Ceph OSD:
ceph auth del osd.<ID>
SSH to the node on which the Ceph OSD cannot be created.
Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.
Note
Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.
On the managed cluster, restart Rook Operator:
kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
[12723] ceph_role_* labels remain after deleting a node from KaaSCephCluster¶
Fixed in 2.8.0
The ceph_role_mon
and ceph_role_mgr
labels that Ceph Controller
assigns to a node during a Ceph cluster creation are not automatically
removed after deleting a node from KaaSCephCluster
.
As a workaround, manually remove the labels using the following commands:
kubectl unlabel node <nodeName> ceph_role_mon
kubectl unlabel node <nodeName> ceph_role_mgr
IAM¶
[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectl describe pod -n kaas <failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message, proceed to the next step. Otherwise, assess the MariaDB pod logs to identify the failure root cause.
Obtain the MariaDB admin password:
kubectl get secret -n kaas mariadb-dbadmin-password -o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d ; echo
Verify that
wsrep_local_state_comment
isDonor
orDesynced
:kubectl exec -it -n kaas <failedMariadbPodName> -- mysql -uroot -p<mariadbAdminPassword> -e "SHOW status LIKE \"wsrep_local_state_comment\";"
Restart the failed pod:
kubectl delete pod -n kaas <failedMariadbPodName>
LCM¶
[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0, the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error: 'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent:
x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier are affected.
As a workaround, restart lcm-agent
using the
service lcm-agent-* restart command on the affected nodes.
[13381] Management and regional clusters with enabled proxy are unreachable¶
Fixed in 2.8.0
After bootstrap, requests to apiserver
fail
on the management and regional clusters with enabled proxy.
As a workaround, before running bootstrap.sh
,
add the entire range of IP addresses that will be used for floating IPs
to the NO_PROXY
environment variable.
[13402] Cluster fails with error: no space left on device¶
Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters
If an application running on a Container Cloud management or managed cluster fails frequently, for example, PostgreSQL, it may produce an excessive amount of core dumps. This leads to the no space left on device error on the cluster nodes and, as a result, to the broken Docker Swarm and the entire cluster.
Core dumps are disabled by default on the operating system of the Container Cloud nodes. But since Docker does not inherit the operating system settings, disable core dumps in Docker using the workaround below.
Warning
The workaround below does not apply to the baremetal-based clusters, including MOS deployments, since Docker restart may destroy the Ceph cluster.
Workaround:
SSH to any machine of the affected cluster using
mcc-user
and the SSH key provided during the cluster creation.In
/etc/docker/daemon.json
, add the following parameters:{ ... "default-ulimits": { "core": { "Hard": 0, "Name": "core", "Soft": 0 } } }
Restart the Docker daemon:
systemctl restart docker
Repeat the steps above on each machine of the affected cluster one by one.
[8112] Nodes occasionally become Not Ready on long-running clusters¶
On long-running Container Cloud clusters, one or more nodes may occasionally
become Not Ready
with different errors in the ucp-kubelet
containers
of failed nodes.
As a workaround, restart ucp-kubelet
on the failed node:
ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED
or UNKNOWN
state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready
and
providerStatus.helm.releaseStatuses.<releaseName>.success
are false
.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the
UNKNOWN
orFAILED
status in the HelmBundle object:kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight} In the command above and in the steps below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.
Example of system response:
stacklight: attempt: 2 chart: "" finishedAt: "2021-02-05T09:41:05Z" hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623 message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io \"helmbundles.lcm.mirantis.com\" already exists"}]' notes: "" status: UNKNOWN success: false version: 0.1.2-mcp-398
Log in to the
helm-controller
pod console:kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
helm delete <failed-release-name>
For example:
helm delete stacklight
Once done, the release triggers for redeployment.
Upgrade¶
[13292] Local volume provisioner pod stuck in Terminating status after upgrade¶
After upgrade of Container Cloud from 2.6.0 to 2.7.0, the local
volume provisioner pod in the default
project is stuck in the
Terminating
status, even after upgrade to 2.8.0.
This issue does not affect functioning of the management, regional, or managed clusters. The issue does not prevent the successful upgrade of the cluster.
Workaround:
Verify that the cluster is affected:
kubectl get pods -n default | grep local-volume-provisioner
If the output contains a pod with the
Terminating
status, the cluster is affected.Capture the affected pod name, if any.
Delete the affected pod:
kuebctl -n default delete pod <LVPPodName> --force
[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE
status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator
release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Workaround:
Log in to the
helm-controller
pod console:kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
Identify the release that stuck in the
PENDING_UPGRADE
status. For example:./helm --host=localhost:44134 history openstack-operator
Example of system response:
REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Dec 15 12:30:41 2020 SUPERSEDED openstack-operator-0.3.9 Install complete 2 Tue Dec 15 12:32:05 2020 SUPERSEDED openstack-operator-0.3.9 Upgrade complete 3 Tue Dec 15 16:24:47 2020 PENDING_UPGRADE openstack-operator-0.3.18 Preparing upgrade
Roll back the failed release to the previous revision:
Download the Helm v3 binary. For details, see official Helm documentation.
Roll back the failed release:
helm rollback <failed-release-name>
For example:
helm rollback openstack-operator 2
Once done, the release will be reconciled.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.