Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.25.0 including the Cluster releases 17.0.0, 16.0.0, and 14.1.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

Bare metal¶

[42386] A load balancer service does not obtain the external IP address¶

Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.

The issue occurs when two services share the same external IP address and have the same externalTrafficPolicy value. Initially, the services have the external IP address assigned and are accessible. After modifying the externalTrafficPolicy value for both services from Cluster to Local, the first service that has been changed remains with no external IP address assigned. Though, the second service, which was changed later, has the external IP assigned as expected.

To work around the issue, make a dummy change to the service object where external IP is <pending>:

Identify the service that is stuck:

kubectl get svc -A | grep pending

Example of system response:

stacklight  iam-proxy-prometheus  LoadBalancer  10.233.28.196  <pending>  443:30430/TCP

Add an arbitrary label to the service that is stuck. For example:

kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1

Example of system response:

service/iam-proxy-prometheus labeled

Verify that the external IP was allocated to the service:

kubectl get svc -n stacklight iam-proxy-prometheus

Example of system response:

NAME                  TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)        AGE
iam-proxy-prometheus  LoadBalancer  10.233.28.196  10.0.34.108  443:30430/TCP  12d

[35089] Calico does not set up networking for a pod¶

Fixed in 17.0.1 and 16.0.1 for MKE 3.7.2

An arbitrary Kubernetes pod may get stuck in an error loop due to a failed Calico networking setup for that pod. The pod cannot access any network resources. The issue occurs more often during cluster upgrade or node replacement, but this can sometimes happen during the new deployment as well.

You may find the following log for the failed pod IP (for example, 10.233.121.132) in calico-node logs:

felix/route_table.go 898: Syncing routes: found unexpected route; ignoring due to grace period. dest=10.233.121.132/32 ifaceName="cali9731b965838" ifaceRegex="^cali." ipVersion=0x4 tableIndex=254
felix/route_table.go 898: Syncing routes: found unexpected route; ignoring due to grace period. dest=10.233.121.132/32 ifaceName="cali9731b965838" ifaceRegex="^cali." ipVersion=0x4 tableIndex=254
...
felix/route_table.go 902: Remove old route dest=10.233.121.132/32 ifaceName="cali9731b965838" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=254
felix/conntrack.go 90: Removing conntrack flows ip=10.233.121.132

The workaround is to manually restart the affected pod:

kubectl delete pod <failedPodID>

[33936] Deletion failure of a controller node during machine replacement¶

Fixed in 17.0.1 and 16.0.1 for MKE 3.7.2

Due to the upstream Calico issue, a controller node cannot be deleted if the calico-node Pod is stuck blocking node deletion. One of the symptoms is the following warning in the baremetal-operator logs:

Resolving dependency Service dhcp-lb in namespace kaas failed: \
the server was unable to return a response in the time allotted,\
but may still be processing the request (get endpoints dhcp-lb).

As a workaround, delete the Pod that is stuck to retrigger the node deletion.

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

OpenStack¶

[37634] Cluster deployment or upgrade is blocked by all pods in ‘Pending’ state¶

Fixed in 17.0.3 and 16.0.3

When using OpenStackCredential with a custom CACert, a management or managed cluster deployment or upgrade is blocked by all pods being stuck in the Pending state. The issue is caused by incorrect secrets being used to initialize the OpenStack external Cloud Provider Interface.

As a workaround, copy CACert from the OpenStackCredential object to openstack-ca-secret:

kubectl --kubeconfig <pathToFailedClusterKubeconfig> patch secret -n kube-system openstack-ca-secret -p '{"data":{"ca.pem":"'$(kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <affectedProjectName> get openstackcredentials <credentialsName> -o go-template="{{.spec.CACert}}")'"}}'

If the CACert from the OpenStackCredential is not base64-encoded:

kubectl --kubeconfig <pathToFailedClusterKubeconfig> patch secret -n kube-system openstack-ca-secret -p '{"data":{"ca.pem":"'$(kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <affectedProjectName> get openstackcredentials <credentialsName> -o go-template="{{.spec.CACert}}" | base64)'"}}'

In either command above, replace the following values:

<pathToFailedClusterKubeconfig> is the file path to the affected managed or management cluster kubeconfig.
<pathToManagementClusterKubeconfig> is the file path to the Container Cloud management cluster kubeconfig.
<affectedProjectName> is the Container Cloud project name containing the cluster with stuck pods. For a management cluster, the value is default.
<credentialsName> is the OpenStackCredential name used for the deployment.

IAM¶

[37766] Sign-in to the MKE web UI fails with ‘invalid parameter: redirect_uri’¶

Fixed in 17.0.3 and 16.0.3

A sign-in to the MKE web UI of the management cluster using the Sign in with External Provider option can fail with the invalid parameter: redirect_uri error.

Workaround:

Log in to the Keycloak admin console.
In the sidebar menu, switch to the IAM realm.
Navigate to Clients > kaas.
On the page, navigate to Seetings > Access settings > Valid redirect URIs.
Add https://<mgmt mke ip>:6443/* to the list of valid redirect URIs and click Save.
Refresh the browser window with the sign-in URI.

LCM¶

[31186,34132] Pods get stuck during MariaDB operations¶

During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[32761] Node cleanup fails due to remaining devices¶

Fixed in 17.1.0 and 16.1.0

On MOSK clusters, the Ansible provisioner may hang in a loop while trying to remove LVM thin pool logical volumes (LVs) due to issues with volume detection before removal. The Ansible provisioner cannot remove LVM thin pool LVs correctly, so it consistently detects the same volumes whenever it scans disks, leading to a repetitive cleanup process.

The following symptoms mean that a cluster can be affected:

A node was configured to use thin pool LVs. For example, it had the OpenStack Cinder role in the past.
A bare metal node deployment flaps between provisioninig and deprovisioning states.
In the Ansible provisioner logs, the following example warnings are growing:
```
88621.log:7389:2023-06-22 16:30:45.109 88621 ERROR ansible.plugins.callback.ironic_log
[-] Ansible task clean : fail failed on node 14eb0dbc-c73a-4298-8912-4bb12340ff49:
{'msg': 'There are more devices to clean', '_ansible_no_log': None, 'changed': False}
```
Important

There are more devices to clean is a regular warning indicating some in-progress tasks. But if the number of such warnings is growing along with the node flapping between provisioninig and deprovisioning states, the cluster is highly likely affected by the issue.

As a workaround, erase disks manually using any preferred tool.

[30294] Replacement of a master node is stuck on the calico-node Pod start¶

Fixed in 2.28.4 (17.3.4 and 16.3.4)

During replacement of a master node on a cluster of any type, the calico-node Pod fails to start on a new node that has the same IP address as the node being replaced.

Workaround:

From a CLI with an MKE client bundle, create a shell alias to start calicoctl using the mirantis/ucp-dsinfo image:

Since MKE 3.7.2

alias calicoctl="\
docker run -i --rm \
--pid host \
--net host \
-e constraint:ostype==linux \
-e ETCD_ENDPOINTS=<etcdEndpoint> \
-e ETCD_KEY_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/key.pem \
-e ETCD_CA_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/ca.pem \
-e ETCD_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/cert.pem \
-v /var/run/calico:/var/run/calico \
-v /var/lib/docker/volumes/ucp-kv-certs/_data:/var/lib/docker/volumes/ucp-kv-certs/_data:ro \
mirantis/ucp-dsinfo:<mkeVersion> \
calicoctl \
"

Before MKE 3.7.2

alias calicoctl="\
docker run -i --rm \
--pid host \
--net host \
-e constraint:ostype==linux \
-e ETCD_ENDPOINTS=<etcdEndpoint> \
-e ETCD_KEY_FILE=/ucp-node-certs/key.pem \
-e ETCD_CA_CERT_FILE=/ucp-node-certs/ca.pem \
-e ETCD_CERT_FILE=/ucp-node-certs/cert.pem \
-v /var/run/calico:/var/run/calico \
-v ucp-node-certs:/ucp-node-certs:ro \
mirantis/ucp-dsinfo:<mkeVersion> \
calicoctl --allow-version-mismatch \
"

In the above command, replace the following values with the corresponding settings of the affected cluster:

<etcdEndpoint> is the etcd endpoint defined in the Calico configuration file. For example, ETCD_ENDPOINTS=127.0.0.1:12378
<mkeVersion> is the MKE version installed on your cluster. For example, mirantis/ucp-dsinfo:3.5.7.

Verify the node list on the cluster:
```
kubectl get node
```
Compare this list with the node list in Calico to identify the old node:
```
calicoctl get node -o wide
```

Remove the old node from Calico:

calicoctl delete node kaas-node-<nodeID>

[5782] Manager machine fails to be deployed during node replacement¶

Fixed in 2.28.4 (17.3.4 and 16.3.4)

During replacement of a manager machine, the following problems may occur:

The system adds the node to Docker swarm but not to Kubernetes
The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

Delete the failed node.
Wait for the MKE cluster to become healthy. To monitor the cluster status:
1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.
2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.
Deploy a new node.

[5568] The calico-kube-controllers Pod fails to clean up resources¶

Fixed in 2.28.4 (17.3.4 and 16.3.4)

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

The calico-kube-controllers Pod fails to clean up resources associated with the deleted node
The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

Ceph¶

[34820] The Ceph ‘rook-operator’ fails to connect to RGW on FIPS nodes¶

Fixed in 17.1.0 and 16.1.0

Due to the upstream Ceph issue, on clusters with the Federal Information Processing Standard (FIPS) mode enabled, the Ceph rook-operator fails to connect to Ceph RADOS Gateway (RGW) pods.

As a workaround, do not place Ceph RGW pods on nodes where FIPS mode is enabled.

[26441] Cluster update fails with the MountDevice failed for volume warning¶

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

Verify that the description of the Pods that failed to run contain the FailedMount events:
```
kubectl -n <affectedProjectName> describe pod <affectedPodName>
```
In the command above, replace the following values:
- <affectedProjectName> is the Container Cloud project name where the Pods failed to run
- <affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.
1. Identify csiPodName of the corresponding csi-rbdplugin:
```
kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
-o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
```
2. Output the affected csiPodName logs:
```
kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
```
Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

On every csi-rbdplugin Pod, search for stuck csi-vol:

for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
  echo $pod
  kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
done

Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```
The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

Delete volumeattachment of the affected Pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.

Update¶

[37268] Container Cloud upgrade is blocked by a node in ‘Prepare’ or ‘Deploy’ state¶

Fixed in 17.1.0 and 16.1.0

Container Cloud upgrade may be blocked by a node being stuck in the Prepare or Deploy state with error processing package openssh-server. The issue is caused by customizations in /etc/ssh/sshd_config, such as additional Match statements. This file is managed by Container Cloud and must not be altered manually.

As a workaround, move customizations from sshd_config to a new file in the /etc/ssh/sshd_config.d/ directory.

[36928] The helm-controller Deployment is stuck during cluster update¶

During a cluster update, a Kubernetes helm-controller Deployment may get stuck in a restarting Pod loop with Terminating and Running states flapping. Other Deployment types may also be affected.

As a workaround, restart the Deployment that got stuck:

kubectl -n <affectedProjectName> get deploy <affectedDeployName> -o yaml

kubectl -n <affectedProjectName> scale deploy <affectedDeployName> --replicas 0

kubectl -n <affectedProjectName> scale deploy <affectedDeployName> --replicas <replicasNumber>

In the command above, replace the following values:

<affectedProjectName> is the Container Cloud project name containing the cluster with stuck Pods
<affectedDeployName> is the Deployment name that failed to run Pods in the specified project
<replicasNumber> is the original number of replicas for the Deployment that you can obtain using the get deploy command

[33438] ‘CalicoDataplaneFailuresHigh’ alert is firing during cluster update¶

During cluster update of a managed bare metal cluster, the false positive CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert, which will disappear once cluster update succeeds.

The observed behavior is typical for calico-node during upgrades, as workload changes occur frequently. Consequently, there is a possibility of temporary desynchronization in the Calico dataplane. This can occasionally result in throttling when applying workload changes to the Calico dataplane.