Known issues¶
This section lists known issues with workarounds for the Mirantis Container Cloud release 2.22.0 including the Cluster release 11.6.0.
For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.
Note
This section also outlines still valid known issues from previous Container Cloud releases.
Bare metal¶
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic
Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the
Terminating
stateA new
ironic
Pod fails to startThe related bare metal host is stuck in the
deprovisioning
state
As a workaround, before deletion of the node running the ironic
Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region
label of the affected
region:
cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr
kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>
Warning
Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.
Equinix Metal with private networking¶
[29296] Deployment of a managed cluster fails during provisioning¶
Deployment of a managed cluster based on Equinix Metal with private networking fails during provisioning with the following error:
InspectionError: Failed to obtain hardware details.
Ensure DHCP relay is up and running
Workaround:
In
deployment/dnsmasq
, udate the image tag version for thedhcpd
container tobase-alpine-20230118150429
:kubectl -n kaas edit deployment/dnsmasq
In
dnsmasq.conf
, override the defaultundionly.kpxe
with theipxe.pxe
one:kubectl -n kaas edit cm dnsmasq-config
Example of existing configuration:
dhcp-boot=/undionly.kpxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local
Example of new configuration:
dhcp-boot=/ipxe.pxe,httpd-http.ipxe.boot.local,dhcp-lb.ipxe.boot.local
vSphere¶
[29647] The ‘Network prepared’ stage of cluster deployment never succeeds¶
During deployment of a vSphere-based management or regional cluster with IPAM
disabled, the Network prepared
stage gets stuck in the NotStarted
status. The issue does not affect cluster deployment. Therefore, disregard
the error message.
LCM¶
[5782] Manager machine fails to be deployed during node replacement¶
During replacement of a manager machine, the following problems may occur:
The system adds the node to Docker swarm but not to Kubernetes
The node
Deployment
gets stuck with failed RethinkDB health checks
Workaround:
Delete the failed node.
Wait for the MKE cluster to become healthy. To monitor the cluster status:
Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.
Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.
Deploy a new node.
[5568] The calico-kube-controllers Pod fails to clean up resources¶
During the unsafe
or forced
deletion of a manager machine running the
calico-kube-controllers
Pod in the kube-system
namespace,
the following issues occur:
The
calico-kube-controllers
Pod fails to clean up resources associated with the deleted nodeThe
calico-node
Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers
Pod, cordon and drain the node:
kubectl cordon <nodeName>
kubectl drain <nodeName>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig
fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the admin
kubeconfig
instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig
again.
To obtain the admin
kubeconfig
:
kubectl --kubeconfig <pathToMgmtKubeconfig> get secret -n <affectedClusterNamespace> \
-o yaml <affectedClusterName>-kubeconfig | awk '/admin.conf/ {print $2}' | \
head -1 | base64 -d > clusterKubeconfig.yaml
If the related cluster is regional, replace <pathToMgmtKubeconfig>
with
<pathToRegionalKubeconfig>
.
TLS configuration¶
[29604] The ‘failed to get kubeconfig’ error during TLS configuration¶
When setting a new Transport Layer Security (TLS) certificate for a cluster,
the false positive failed to get kubeconfig
error may occur on the
Waiting for TLS settings to be applied
stage. No actions are required.
Therefore, disregard the error.
To verify the status of the TLS configuration being applied:
kubectl get cluster <ClusterName> -n <ClusterProjectName> -o jsonpath-as-json="{.status.providerStatus.tls.<Application>}"
Possible values for the <Application>
parameter are as follows:
keycloak
ui
cache
mke
iamProxyAlerta
iamProxyAlertManager
iamProxyGrafana
iamProxyKibana
iamProxyPrometheus
Example of system response:
[
{
"expirationTime": "2024-01-06T09:37:04Z",
"hostname": "domain.com",
}
]
In this example, expirationTime
equals the NotAfter
field of the
server certificate. And the value of hostname
contains the configured
application name.
StackLight¶
[30040] OpenSearch is not in the ‘deployed’ status during cluster update¶
Note
The issue may affect the Container Cloud or Cluster release update to the following versions:
2.22.0 for management and regional clusters
11.6.0 for management, regional, and managed clusters
13.2.5, 13.3.5, 13.4.3, and 13.5.2 for attached MKE clusters
The issue does not affect clusters originally deployed since the following Cluster releases: 11.0.0, 8.6.0, 7.6.0.
During cluster update to versions mentioned in the note above, the following OpenSearch-related error may occur on clusters that were originally deployed or attached using Container Cloud 2.15.0 or earlier, before the transition from Elasticsearch to OpenSearch:
The stacklight/opensearch release of the stacklight/stacklight-bundle HelmBundle
reconciled by the stacklight/stacklight-helm-controller Controller
is not in the "deployed" status for the last 15 minutes.
The issue affects clusters with elasticsearch.persistentVolumeClaimSize
configured for values other than 30Gi
.
To verify that the cluster is affected:
Verify whether the
HelmBundleReleaseNotDeployed
alert for theopensearch
release is firing. If so, the cluster is most probably affected. Otherwise, the cluster is not affected.Verify the reason of the
HelmBundleReleaseNotDeployed
alert for theopensearch
release:kubectl get helmbundle stacklight-bundle -n stacklight -o json | jq '.status.releaseStatuses[] | select(.chart == "opensearch") | .message'
Example system response from the affected cluster:
Upgrade "opensearch" failed: cannot patch "opensearch-master" with kind StatefulSet: \ StatefulSet.apps "opensearch-master" is invalid: spec: Forbidden: \ updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
Workaround:
Scale down the
opensearch-dashboards
andmetricbeat
resources to0
:kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards && \ kubectl -n stacklight get pods -l app=opensearch-dashboards | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=10m pod kubectl -n stacklight scale --replicas 0 deployment metricbeat && \ kubectl -n stacklight get pods -l app=metricbeat | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=10m pod
Wait for the commands in this and next step to complete. The completion time depends on the cluster size.
Disable the
elasticsearch-curator
CronJob:kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": true}}'
Scale down the
opensearch-master
StatefulSet:kubectl -n stacklight scale --replicas 0 statefulset opensearch-master && \ kubectl -n stacklight get pods -l app=opensearch-master | awk '{if (NR!=1) {print $1}}' | xargs -r \ kubectl -n stacklight wait --for=delete --timeout=30m pod
Delete the OpenSearch Helm release:
helm uninstall --no-hooks opensearch -n stacklight
Wait up to 5 minutes for Helm Controller to retry the upgrade and properly create the
opensearch-master
StatefulSet.To verify readiness of the
opensearch-master
Pods:kubectl -n stacklight wait --for=condition=Ready --timeout=30m pod -l app=opensearch-master
Example of a successful system response in an HA setup:
pod/opensearch-master-0 condition met pod/opensearch-master-1 condition met pod/opensearch-master-2 condition met
Example of a successful system response in an non-HA setup:
pod/opensearch-master-0 condition met
Scale up the
opensearch-dashboards
andmetricbeat
resources:kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards && \ kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=opensearch-dashboards kubectl -n stacklight scale --replicas 1 deployment metricbeat && \ kubectl -n stacklight wait --for=condition=Ready --timeout=10m pod -l app=metricbeat
Enable the
elasticsearch-curator
CronJob:kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec": {"suspend": false}}'
[29329] Recreation of the Patroni container replica is stuck¶
During an update of a Container Cloud cluster of any type, recreation of the
Patroni container replica is stuck in the degraded state due to the liveness
probe killing the container that runs the pg_rewind
procedure. The issue
affects clusters on which the pg_rewind
procedure takes more time than the
full cycle of the liveness probe.
The sample logs of the affected cluster:
INFO: doing crash recovery in a single user mode
ERROR: Crash recovery finished with code=-6
INFO: stdout=
INFO: stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0 LOG: database system was interrupted; last known up at 2023-01-10 17:00:59 GMT
[64]: [2-1] 63be8d72.40 0 LOG: could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192
[64]: [3-1] 63be8d72.40 0 LOG: invalid primary checkpoint record
[64]: [4-1] 63be8d72.40 0 PANIC: could not locate a valid checkpoint record
Workaround:
For the affected replica and PVC, run:
kubectl delete persistentvolumeclaim/storage-volume-patroni-<replica-id> -n stacklight
kubectl delete pod/patroni-<replica-id> -n stacklight
[28822] Reference Application triggers alerts during its upgrade¶
On managed clusters with enabled Reference Application, the following alerts are triggered during a managed cluster update from the Cluster release 11.5.0 to 11.6.0 or 7.11.0 to 11.5.0:
KubeDeploymentOutage
for therefapp
DeploymentRefAppDown
RefAppProbeTooLong
RefAppTargetDown
This behavior is expected, no actions are required. Therefore, disregard these alerts.
[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM¶
On the baremetal-based management clusters, the restarts count of the
metric-collector
Pod is increased in time with reason: OOMKilled
in
the containerStatuses
of the metric-collector
Pod. Only clusters with
HTTP proxy enabled are affected.
Such behavior is expected. Therefore, disregard these restarts.
[28373] Alerta can get stuck after a failed initialization¶
During creation of a Container Cloud cluster of any type with StackLight
enabled, Alerta can get stuck after a failed initialization with only 1 Pod
in the READY
state. For example:
kubectl get po -n stacklight -l app=alerta
NAME READY STATUS RESTARTS AGE
pod/alerta-5f96b775db-45qsz 1/1 Running 0 20h
pod/alerta-5f96b775db-xj4rl 0/1 Running 0 20h
Workaround:
Recreate the affected Alerta Pod:
kubectl --kubeconfig <affectedClusterKubeconfig> -n stacklight delete pod <stuckAlertaPodName>
Verify that both Alerta Pods are in the
READY
state:kubectl get po -n stacklight -l app=alerta
[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
Note
Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.
On a managed cluster, the StackLight pods may get stuck with the
Pod predicate NodeAffinity failed
error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.
As a workaround, remove the stuck pods:
kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>
Ceph¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim
getting stuck in the Pending
state for the
prometheus-server
StatefulSet and the
MountVolume.MountDevice failed for volume
warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount
events:kubectl -n <affectedProjectName> describe pod <affectedPodName>
In the command above, replace the following values:
<affectedProjectName>
is the Container Cloud project name where the Pods failed to run<affectedPodName>
is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the
csi-rbdplugin
logs of the affected node contain therbd volume mount failed: <csi-vol-uuid> is being used
error. The<csi-vol-uuid>
is a unique RBD volume name.Identify
csiPodName
of the correspondingcsi-rbdplugin
:kubectl -n rook-ceph get pod -l app=csi-rbdplugin \ -o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
Output the affected
csiPodName
logs:kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
Scale down the affected
StatefulSet
orDeployment
of the Pod that fails to0
replicas.On every
csi-rbdplugin
Pod, search for stuckcsi-vol
:for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid> done
Unmap the affected
csi-vol
:rbd unmap -o force /dev/rbd<i>
The
/dev/rbd<i>
value is a mapped RBD volume that usescsi-vol
.Delete
volumeattachment
of the affected Pod:kubectl get volumeattachments | grep <csi-vol-uuid> kubectl delete volumeattacmhent <id>
Scale up the affected
StatefulSet
orDeployment
back to the original number of replicas and wait until its state becomesRunning
.