Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.13.0 including the Cluster releases 7.3.0, 6.19.0, and 5.20.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
Bare metal¶
[18752] Bare metal hosts in ‘provisioned registration error’ state after update¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
After update of a management or managed cluster created using the Container
Cloud release earlier than 2.6.0, a bare metal host state is
Provisioned
in the Container Cloud web UI while having the error
state
in logs with the following message:
status:
errorCount: 1
errorMessage: 'Host adoption failed: Error while attempting to adopt node 7a8d8aa7-e39d-48ec-98c1-ed05eacc354f:
Validation of image href http://10.10.10.10/images/stub_image.qcow2 failed,
reason: Got HTTP code 404 instead of 200 in response to HEAD request..'
errorType: provisioned registration error
The issue is caused by the image URL pointing to an unavailable resource due to the URI IP change during update. As a workaround, update URLs for the bare metal host status and spec with the correct values that use a stable DNS record as a host.
Workaround:
Note
In the commands below, we update master-2
as an example.
Replace it with the corresponding value to fit your deployment.
Exit Lens.
In a new terminal, configure access to the affected cluster.
Start
kube-proxy
:kubectl proxy &
Pause the reconcile:
kubectl patch bmh master-2 --type=merge --patch '{"metadata":{"annotations":{"baremetalhost.metal3.io/paused": "true"}}}'
Create the payload data with the following content:
For
status_payload.json
:{ "status": { "errorCount": 0, "errorMessage": "", "provisioning": { "image": { "checksum": "http://httpd-http/images/stub_image.qcow2.md5sum", "url": "http://httpd-http/images/stub_image.qcow2" }, "state": "provisioned" } } }
For
status_payload.json
:{ "spec": { "image": { "checksum": "http://httpd-http/images/stub_image.qcow2.md5sum", "url": "http://httpd-http/images/stub_image.qcow2" } } }
Verify that the payload data is valid:
cat status_payload.json | jq cat spec_payload.json | jq
The system response must contain the data added in the previous step.
Patch the bare metal host status with payload:
curl -k -v -XPATCH -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" --data-binary "@status_payload.json" 127.0.0.1:8001/apis/metal3.io/v1alpha1/namespaces/default/baremetalhosts/master-2/status
Patch the bare metal host spec with payload:
kubectl patch bmh master-2 --type=merge --patch "$(cat spec_payload.json)"
Resume the reconcile:
kubectl patch bmh master-2 --type=merge --patch '{"metadata":{"annotations":{"baremetalhost.metal3.io/paused":null}}}'
Close the terminal to quit
kube-proxy
and resume Lens.
[17792] Full preflight fails with a timeout waiting for BareMetalHost¶
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true
, the script fails with the following message:
preflight check failed: preflight full check failed: \
error waiting for BareMetalHosts to power on: \
timed out waiting for the condition
Workaround:
Unset full preflight using the
unset KAAS_BM_FULL_PREFLIGHT
environment variable.Rerun bootstrap.sh preflight that executes fast preflight instead.
OpenStack¶
[10424] Regional cluster cleanup fails by timeout¶
An OpenStack-based regional cluster cleanup fails with the timeout error.
Workaround:
Wait for the
Cluster
object to be deleted in the bootstrap cluster:kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
The system output must be empty.
Remove the bootstrap cluster manually:
./bin/kind delete cluster --name clusterapi
vSphere¶
[19468] ‘Failed to remove finalizer from machine’ error during cluster deletion¶
If a RHEL license is removed before the related managed cluster is deleted,
the cluster deletion hangs with the following Machine
object error:
Failed to remove finalizer from machine ...
failed to get RHELLicense object
As a workaround, recreate the removed RHEL license object with the same name using the Container Cloud web UI or API.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration
annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object. Use kubectl create, kubectl patch, or kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
[14080] Node leaves the cluster after IP address change¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient
prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient
from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.
Symptoms:
One of the nodes is in the
NodeNotReady
ordown
state:kubectl get nodes -o wide docker node ls
The UCP Swarm manager logs on the healthy manager node contain the following example error:
docker logs -f ucp-swarm-manager level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
If the affected node is manager:
The output of the docker info command contains the following example error:
Error: rpc error: code = Unknown desc = The swarm does not have a leader. \ It's possible that too few managers are online. \ Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
docker logs -f ucp-controller "warning","msg":"Node State Active check error: \ Swarm Mode Manager health check error: \ info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \ Is the docker daemon running?
On the affected node, the IP address on the first interface
eth0
does not match the IP address configured in Docker. Verify theNode Address
field in the output of the docker info command.The following lines are present in
/var/log/messages
:dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface
eth0
on VMs with a static IP address.If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.
LCM¶
[18708] ‘Pending’ state of machines during a cluster deployment or attachment¶
During deployment of any Container Cloud cluster or attachment of an existing
MKE cluster that is not deployed by Container Cloud, the machines are stuck
in the Pending
state with no lcmcluster-controller
entries from the
lcm-controller
logs except the following ones:
kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> logs lcm-lcm-controller-<controllerID> -n kaas | grep lcmcluster-controller
{"level":"info","ts":1634808016.777575,"logger":"controller-runtime.manager.controller.lcmcluster-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
{"level":"info","ts":1634808016.8779392,"logger":"controller-runtime.manager.controller.lcmcluster-controller","msg":"Starting EventSource","source":"kind source: /, Kind="}
The issue affects only clusters with the Container Cloud projects (Kubernetes
namespaces) in the Terminating
state.
Workaround:
Verify the state of the Container Cloud projects:
kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> get ns
If any project is in the
Terminating
state, proceed to the next step. Otherwise, further assess the cluster logs to identify the root cause of the issue.Clean up the project that is stuck in the
Terminating
state:Identify the objects that are stuck in the project:
kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> get ns <projectName> -o yaml
Example of system response:
... status: conditions: ... - lastTransitionTime: "2021-10-19T17:05:23Z" message: 'Some resources are remaining: pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining
Remove the
metadata.finalizers
field from the affected objects:kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> edit <objectType>/<objecName> -n <objectProjectName>
Restart
lcm-controller
on the affected management or regional cluster:kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> get pod -n kaas | awk '/lcm-controller/ {print $1}' | xargs kubectl --kubeconfig <pathToMgmtOrRegionalClusterKubeconfig> delete pod -n kaas
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED
or UNKNOWN
state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready
and
providerStatus.helm.releaseStatuses.<releaseName>.success
are false
.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the
UNKNOWN
orFAILED
status in the HelmBundle object:kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight} In the command above and in the steps below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.
Example of system response:
stacklight: attempt: 2 chart: "" finishedAt: "2021-02-05T09:41:05Z" hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623 message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io \"helmbundles.lcm.mirantis.com\" already exists"}]' notes: "" status: UNKNOWN success: false version: 0.1.2-mcp-398
Log in to the
helm-controller
pod console:kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller
Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
helm delete <failed-release-name>
For example:
helm delete stacklight
Once done, the release triggers for redeployment.
IAM¶
StackLight¶
[19682] URLs in Salesforce alerts use HTTP for IAM with enabled TLS¶
Prometheus web UI URLs in StackLight notifications sent to Salesforce use a wrong protocol: HTTP instead of HTTPS. The issue affects deployments with TLS enabled for IAM.
The workaround is to manually change the URL protocol in the web browser.
Storage¶
[20312] Creation of ceph-based PVs gets stuck in Pending state¶
The csi-rbdplugin-provisioner
pod (csi-provisioner
container) may show
constant retries attempting to create a PV if the csi-rbdplugin-provisioner
pod was scheduled and started on a node with no connectivity to the Ceph
storage. As a result, creation of a Ceph-based persistent volume (PV) may get
stuck in the Pending state.
As a workaround manually specify the affinity or toleration rules for the
csi-rbdplugin-provisioner
pod.
Workaround:
On the managed cluster, open the
rook-ceph-operator-config
map for editing:kubectl edit configmap -n rook-ceph rook-ceph-operator-config
To avoid spawning pods on the nodes where this is not needed, set the provisioner node affinity specifying the required node labels. For example:
CSI_PROVISIONER_NODE_AFFINITY: "role=storage-node; storage=rook, ceph"
Note
If needed, you can also specify CSI_PROVISIONER_TOLERATIONS
tolerations. For example:
CSI_PROVISIONER_TOLERATIONS: |
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
[18879] The RGW pod overrides the global CA bundle with an incorrect mount¶
During deployment of a Ceph cluster, the RADOS Gateway (RGW) pod overrides
the global CA bundle located at /etc/pki/tls/certs
with an incorrect
self-signed CA bundle. The issue affects only clusters with public
certificates.
Workaround:
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with a corresponding value.Select from the following options:
If you are using the GoDaddy certificates, in the
cephClusterSpec.objectStorage.rgw
section, replace thecacert
parameters with your public CA certificate that already contains both the root CA certificate and intermediate CA certificate:cephClusterSpec: objectStorage: rgw: SSLCert: cacert: | -----BEGIN CERTIFICATE----- ca-certificate here -----END CERTIFICATE----- tlsCert: | -----BEGIN CERTIFICATE----- private TLS certificate here -----END CERTIFICATE----- tlsKey: | -----BEGIN RSA PRIVATE KEY----- private TLS key here -----END RSA PRIVATE KEY-----
If you are using the DigiCert certificates:
Download the
<root_CA>
from DigiCert.In the
cephClusterSpec.objectStorage.rgw
section, replace thecacert
parameters with your public intermediate CA certificate along with the root one:cephClusterSpec: objectStorage: rgw: SSLCert: cacert: | -----BEGIN CERTIFICATE----- <root CA here> <intermediate CA here> -----END CERTIFICATE----- tlsCert: | -----BEGIN CERTIFICATE----- private TLS certificate here -----END CERTIFICATE----- tlsKey: | -----BEGIN RSA PRIVATE KEY----- private TLS key here -----END RSA PRIVATE KEY-----
[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13¶
Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1
Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are
unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and
may cause data corruption. Therefore, manageOsds
is disabled until further
notice.
As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster, perform the steps described in Remove Ceph OSD manually.
Upgrade¶
[4288] Equinix and MOS managed clusters update failure¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.
The Equinix Metal and MOS-based managed clusters may fail to update to the latest Cluster release with kubelet being stuck and reporting authorization errors.
The cluster is affected by the issue if you see the Failed to make webhook authorizer request: context canceled error in the kubelet logs:
docker logs ucp-kubelet --since 5m 2>&1 | grep 'Failed to make webhook authorizer request: context canceled'
As a workaround, restart the ucp-kubelet
container on the affected
node(s):
ctr -n com.docker.ucp snapshot rm ucp-kubelet
docker rm -f ucp-kubelet
Note
Ignore failures in the output of the first command, if any.
[16379,23865] Cluster update fails with the FailedMount warning¶
An Equinix-based management or managed cluster fails to update with the
FailedAttachVolume
and FailedMount
warnings.
Workaround:
Verify that the description of the pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
<affectedProjectName>
is the Container Cloud project name where the pods failed to run<affectedPodName>
is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the pod that fails to init to0
replicas.On every
csi-rbdplugin
pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state isRunning
.
[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE
status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator
release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Workaround:
Log in to the
helm-controller
pod console:kubectl exec -n kube-system -it helm-controller-0 sh -c tiller
Identify the release that stuck in the
PENDING_UPGRADE
status. For example:./helm --host=localhost:44134 history openstack-operator
Example of system response:
REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Dec 15 12:30:41 2020 SUPERSEDED openstack-operator-0.3.9 Install complete 2 Tue Dec 15 12:32:05 2020 SUPERSEDED openstack-operator-0.3.9 Upgrade complete 3 Tue Dec 15 16:24:47 2020 PENDING_UPGRADE openstack-operator-0.3.18 Preparing upgrade
Roll back the failed release to the previous revision:
Download the Helm v3 binary. For details, see official Helm documentation.
Roll back the failed release:
helm rollback <failed-release-name>
For example:
helm rollback openstack-operator 2
Once done, the release will be reconciled.
Container Cloud web UI¶
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.