Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.4.0 including the Cluster release 5.11.0 and 6.10.0.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
AWS¶
[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending
state
with the pod has unbound immediate PersistentVolumeClaims
and
node(s) had volume node affinity conflict
errors.
Warning
The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts of the affected pods:
kubectl get pod/<pod_name1> pod/<pod_name2> \ -o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
Note
In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.
Delete the affected
Pods
andPersistentVolumeClaims
to reschedule them: For example, for StackLight:kubectl -n stacklight delete \ pod/<pod_name1> pod/<pod_name2> ... pvc/<pvc_name2> pvc/<pvc_name2> ...
Bare metal¶
[9875] Full preflight fails with a timeout waiting for BareMetalHost¶
Fixed in Container Cloud 2.6.0
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true
, the script fails with the following message:
failed to create BareMetal objects: failed to wait for objects of kinds BareMetalHost
to become available: timed out waiting for the condition
As a workaround, unset full preflight using unset KAAS_BM_FULL_PREFLIGHT
to run fast preflight instead.
[11102] Keepalived does not detect the loss of VIP deleted by netplan¶
Fixed in Container Cloud 2.5.0
This issue may occur on the baremetal-based managed clusters that are created using L2 templates when network configuration is changed by the user or when Container Cloud is updated from version 2.3.0 to 2.4.0.
Due to the community issue, Keepalived 1.3.9 does not detect and restore a VIP of a managed cluster node after running the netplan apply command. The command is used to apply network configuration changes.
As a result, the Kubernetes API on the affected managed clusters becomes inaccessible.
As a workaround, log in to all nodes of the affected managed clusters and restart Keepalived using systemctl restart keepalived.
[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed on a new disk if an old volume group with the same name already exists on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe: true
parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost
object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost
object. Inspect the l2RenderResult
and ipAllocationResult
object fields for error messages.
[8560] Manual deletion of BareMetalHost leads to its silent removal¶
Fixed in Container Cloud 2.5.0
If BareMetalHost
is manually removed from a managed cluster, it is
silently removed without a power-off and deprovision that leads to a managed
cluster failures.
Workaround:
Do not manually delete a BareMetalHost
that has the Provisioned
status.
Storage¶
[10060] Ceph OSD node removal fails¶
Fixed in Container Cloud 2.7.0
A Ceph node removal is not being triggered properly after updating
the KaasCephCluster
custom resource (CR). Both management and managed
clusters are affected.
Workaround:
Remove the parameters for a Ceph OSD from the
KaasCephCluster
CR as described in Operations Guide: Add, remove, or reconfigure Ceph nodes.Obtain the IDs of the
osd
andmon
services that are located on the old node:Obtain the UID of the affected machine:
kubectl get machine <CephOSDNodeName> -n <ManagedClusterProjectName> -o jsonpath='{.metadata.annotations.kaas\.mirantis\.com\/uid}'
Export
kubeconfig
of your managed cluster. For example:export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
Identify the pods IDs that run the
osd
andmon
services:kubectl get pods -o wide -n rook-ceph | grep <affectedMachineUID> | grep -E "mon|osd"
Example of the system response extract:
rook-ceph-mon-c-7bbc5d757d-5bpws 1/1 Running 1 6h1m rook-ceph-osd-2-58775d5568-5lklw 1/1 Running 4 44h rook-ceph-osd-prepare-705ae6c647cfdac928c63b63e2e2e647-qn4m9 0/1 Completed 0 94s
The pods IDs include the
osd
ormon
services IDs. In the example system response above, theosd
ID is2
and themon
ID isc
.
Delete the deployments of the
osd
andmon
services obtained in the previous step:kubectl delete deployment rook-ceph-osd(mon)-<ID> -n rook-ceph
For example:
kubectl delete deployment rook-ceph-mon-c -n rook-ceph kubectl delete deployment rook-ceph-osd-2 -n rook-ceph
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Rebalance the Ceph OSDs:
ceph osd out osd(s).ID
Wait for the rebalance to complete.
Rebalance the Ceph data:
ceph osd purge osd(s).ID
Wait for the Ceph data to rebalance.
Remove the old node from the Ceph OSD tree:
ceph osd crush rm <NodeName>
If the removed node contained
mon
services, remove them:ceph mon rm <monID>
[9928] Ceph rebalance during a managed cluster update¶
Fixed in Container Cloud 2.5.0
During a managed cluster update, Ceph rebalance leading to data loss may occur.
Workaround:
Before updating a managed cluster:
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it <ceph-tools-pod-name> bash
Set the
noout
flag:ceph osd set noout
After updating a managed cluster:
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it <ceph-tools-pod-name> bash
Unset the
noout
flag:ceph osd unset noout
[7073] Cannot automatically remove a Ceph node¶
When removing a worker node, it is not possible to automatically remove a Ceph node. The workaround is to manually remove the Ceph node from the Ceph cluster as described in Operations Guide: Add, remove, or reconfigure Ceph nodes before removing the worker node from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile
, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff
state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export
kubeconfig
of your managed cluster. For example:export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
Log in to the
ceph-tools
pod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
Delete the authorization key for the failed Ceph OSD:
ceph auth del osd.<ID>
SSH to the node on which the Ceph OSD cannot be created.
Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.
Note
Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.
On the managed cluster, restart Rook Operator:
kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
LCM¶
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED
or UNKNOWN
state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready
and
providerStatus.helm.releaseStatuses.<releaseName>.success
are false
.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the
UNKNOWN
orFAILED
status in the HelmBundle object:kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight} In the command above and in the steps below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.
Example of system response:
stacklight: attempt: 2 chart: "" finishedAt: "2021-02-05T09:41:05Z" hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623 message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io \"helmbundles.lcm.mirantis.com\" already exists"}]' notes: "" status: UNKNOWN success: false version: 0.1.2-mcp-398
Log in to the
helm-controller
pod console:kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
helm delete <failed-release-name>
For example:
helm delete stacklight
Once done, the release triggers for redeployment.
StackLight¶
[11001] Patroni pod fails to start¶
Fixed in Container Cloud 2.6.0
After the management cluster update, a Patroni pod may fail to start and remain
in the CrashLoopBackOff
status. Messages similar to the following ones may
be present in Patroni logs:
Local timeline=4 lsn=0/A000000
master_timeline=6
master: history=1 0/1ADEB48 no recovery target specified
2 0/8044500 no recovery target specified
3 0/A0000A0 no recovery target specified
4 0/A1B6CB0 no recovery target specified
5 0/A2C0C80 no recovery target specified
As a workaround, reinitialize the affected pod with a new volume by deleting
the pod itself and the associated PersistentVolumeClaim
(PVC).
Workaround:
Obtain the PVC of the affected pod:
kubectl -n stacklight get "pod/${POD_NAME}" -o jsonpath='{.spec.volumes[?(@.name=="storage-volume")].persistentVolumeClaim.claimName}'
Delete the affected pod and its PVC:
kubectl -n stacklight delete "pod/${POD_NAME}" "pvc/${POD_PVC}" sleep 3 # wait for StatefulSet to reschedule the pod, but miss dependent PVC creation kubectl -n stacklight delete "pod/${POD_NAME}"
Management cluster update¶
[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE
status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator
release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Workaround:
Log in to the
helm-controller
pod console:kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
Identify the release that stuck in the
PENDING_UPGRADE
status. For example:./helm --host=localhost:44134 history openstack-operator
Example of system response:
REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Dec 15 12:30:41 2020 SUPERSEDED openstack-operator-0.3.9 Install complete 2 Tue Dec 15 12:32:05 2020 SUPERSEDED openstack-operator-0.3.9 Upgrade complete 3 Tue Dec 15 16:24:47 2020 PENDING_UPGRADE openstack-operator-0.3.18 Preparing upgrade
Roll back the failed release to the previous revision:
Download the Helm v3 binary. For details, see official Helm documentation.
Roll back the failed release:
helm rollback <failed-release-name>
For example:
helm rollback openstack-operator 2
Once done, the release will be reconciled.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.