Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.26.4 including the Cluster releases 17.1.4 and 16.1.4.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

Bare metal¶

[42408] Kernel is not updated on manager nodes after cluster update¶

Fixed in 17.1.5 and 16.1.5

After managed cluster update, old versions of system packages, including kernel, may remain on the manager nodes. This issue occurs because the task responsible for updating packages fails to run after updating Ubuntu mirrors.

As a workaround, manually run apt-get upgrade on every manager node after the cluster update but before rebooting the node.

[41305] DHCP responses are lost between dnsmasq and dhcp-relay pods¶

After node maintenance of a management cluster, the newly added nodes may fail to undergo provisioning successfully. The issue relates to new nodes that are in the same L2 domain as the management cluster.

The issue was observed on environments having management cluster nodes configured with a single L2 segment used for all network traffic (PXE and LCM/management networks).

To verify whether the cluster is affected:

Verify whether the dnsmasq and dhcp-relay pods run on the same node in the management cluster:

kubectl -n kaas get pods -o wide| grep -e "dhcp\|dnsmasq"

Example of system response:

dhcp-relay-7d85f75f76-5vdw2   2/2   Running   2 (36h ago)   36h   10.10.0.122     kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>
dnsmasq-8f4b484b4-slhbd       5/5   Running   1 (36h ago)   36h   10.233.123.75   kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>

If this is the case, proceed to the workaround below.

Workaround:

Make sure that at least two management cluster nodes are schedulable:

kubectl get node

Example of a positive system response:

NAME                                             STATUS   ROLES    AGE   VERSION
kaas-node-bcedb87b-b3ce-46a4-a4ca-ea3068689e40   Ready    master   37h   v1.27.10-mirantis-1
kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   Ready    master   37h   v1.27.10-mirantis-1
kaas-node-ad5a6f51-b98f-43c3-91d5-55fed3d0ff21   Ready    master   37h   v1.27.10-mirantis-1

Delete the dhcp-relay pod:

kubectl -n kaas delete pod <dhcp-relay-xxxxx>

Verify that the dnsmasq and dhcp-relay pods are scheduled into different nodes:

kubectl -n kaas get pods -o wide| grep -e "dhcp\|dnsmasq"

Example of a positive system response:

dhcp-relay-7d85f75f76-rkv03   2/2   Running   0             49s   10.10.0.121     kaas-node-bcedb87b-b3ce-46a4-a4ca-ea3068689e40   <none>   <none>
dnsmasq-8f4b484b4-slhbd       5/5   Running   1 (37h ago)   37h   10.233.123.75   kaas-node-8a24b81c-76d0-4d4c-8421-962bd39df5ad   <none>   <none>

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

LCM¶

[41540] LCM Agent cannot grab storage information on a host¶

Fixed in 17.1.5 and 16.1.5

Due to issues with managing physical NVME devices, lcm-agent cannot grab storage information on a host. As a result, lcmmachine.status.hostinfo.hardware is empty and the following example error is present in logs:

{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent", \
"msg":"get hardware details", \
"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b", \
"error":"new storage info: get disk info \"nvme0c0n1\": \
invoke command: exit status 1","errorVerbose":"exit status 1

As a workaround, on the affected node, create a symlink for any device indicated in lcm-agent logs. For example:

ln -sfn /dev/nvme0n1 /dev/nvme0c0n1

[39437] Failure to replace a master node on a Container Cloud cluster¶

During the replacement of a master node on a cluster of any type, the process may get stuck with Kubelet's NodeReady condition is Unknown in the machine status on the remaining master nodes.

As a workaround, log in on the affected node and run the following command:

docker restart ucp-kubelet

[31186,34132] Pods get stuck during MariaDB operations¶

Due to the upstream MariaDB issue, during MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.
Verify that other replicas are up and ready.
Remove the galera.cache file for the affected mariadb-server Pod.
Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[30294] Replacement of a master node is stuck on the calico-node Pod start¶

During replacement of a master node on a cluster of any type, the calico-node Pod fails to start on a new node that has the same IP address as the node being replaced.

Workaround:

From a CLI with an MKE client bundle, create a shell alias to start calicoctl using the mirantis/ucp-dsinfo image:

Since MKE 3.7.2

alias calicoctl="\
docker run -i --rm \
--pid host \
--net host \
-e constraint:ostype==linux \
-e ETCD_ENDPOINTS=<etcdEndpoint> \
-e ETCD_KEY_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/key.pem \
-e ETCD_CA_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/ca.pem \
-e ETCD_CERT_FILE=/var/lib/docker/volumes/ucp-kv-certs/_data/cert.pem \
-v /var/run/calico:/var/run/calico \
-v /var/lib/docker/volumes/ucp-kv-certs/_data:/var/lib/docker/volumes/ucp-kv-certs/_data:ro \
mirantis/ucp-dsinfo:<mkeVersion> \
calicoctl \
"

Before MKE 3.7.2

alias calicoctl="\
docker run -i --rm \
--pid host \
--net host \
-e constraint:ostype==linux \
-e ETCD_ENDPOINTS=<etcdEndpoint> \
-e ETCD_KEY_FILE=/ucp-node-certs/key.pem \
-e ETCD_CA_CERT_FILE=/ucp-node-certs/ca.pem \
-e ETCD_CERT_FILE=/ucp-node-certs/cert.pem \
-v /var/run/calico:/var/run/calico \
-v ucp-node-certs:/ucp-node-certs:ro \
mirantis/ucp-dsinfo:<mkeVersion> \
calicoctl --allow-version-mismatch \
"

In the above command, replace the following values with the corresponding settings of the affected cluster:

<etcdEndpoint> is the etcd endpoint defined in the Calico configuration file. For example, ETCD_ENDPOINTS=127.0.0.1:12378
<mkeVersion> is the MKE version installed on your cluster. For example, mirantis/ucp-dsinfo:3.5.7.

Verify the node list on the cluster:
```
kubectl get node
```
Compare this list with the node list in Calico to identify the old node:
```
calicoctl get node -o wide
```

Remove the old node from Calico:

calicoctl delete node kaas-node-<nodeID>

[5782] Manager machine fails to be deployed during node replacement¶

During replacement of a manager machine, the following problems may occur:

The system adds the node to Docker swarm but not to Kubernetes
The node Deployment gets stuck with failed RethinkDB health checks

Workaround:

Delete the failed node.
Wait for the MKE cluster to become healthy. To monitor the cluster status:
1. Log in to the MKE web UI as described in Connect to the Mirantis Kubernetes Engine web UI.
2. Monitor the cluster status as described in MKE Operations Guide: Monitor an MKE cluster with the MKE web UI.
Deploy a new node.

[5568] The calico-kube-controllers Pod fails to clean up resources¶

During the unsafe or forced deletion of a manager machine running the calico-kube-controllers Pod in the kube-system namespace, the following issues occur:

The calico-kube-controllers Pod fails to clean up resources associated with the deleted node
The calico-node Pod may fail to start up on a newly created node if the machine is provisioned with the same IP address as the deleted machine had

As a workaround, before deletion of the node running the calico-kube-controllers Pod, cordon and drain the node:

kubectl cordon <nodeName>
kubectl drain <nodeName>

Ceph¶

[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶

Fixed in 2.27.0 (17.2.0 and 16.2.0)

During graceful reboot of a cluster with Ceph enabled, the reboot is blocked with the following message in the MiraCephMaintenance object status:

message: ClusterMaintenanceRequest found, Ceph Cluster is not ready to upgrade,
 delaying cluster maintenance

As a workaround, add the following snippet to the cephFS section under metadataServer in the spec section of <kcc-name>.yaml in the Ceph cluster:

cephClusterSpec:
  sharedFilesystem:
    cephFS:
    - name: cephfs-store
      metadataServer:
        activeCount: 1
        healthCheck:
          livenessProbe:
            probe:
              failureThreshold: 5
              initialDelaySeconds: 30
              periodSeconds: 30
              successThreshold: 1
              timeoutSeconds: 5

[26441] Cluster update fails with the MountDevice failed for volume warning¶

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

Verify that the description of the Pods that failed to run contain the FailedMount events:
```
kubectl -n <affectedProjectName> describe pod <affectedPodName>
```
In the command above, replace the following values:
- <affectedProjectName> is the Container Cloud project name where the Pods failed to run
- <affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.
1. Identify csiPodName of the corresponding csi-rbdplugin:
```
kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
-o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
```
2. Output the affected csiPodName logs:
```
kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
```
Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

On every csi-rbdplugin Pod, search for stuck csi-vol:

for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
  echo $pod
  kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
done

Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```
The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

Delete volumeattachment of the affected Pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.

StackLight¶

[42304] Failure of shard relocation in the OpenSearch cluster¶

Fixed in 17.2.0, 16.2.0, 17.1.6, 16.1.6

On large managed clusters, shard relocation may fail in the OpenSearch cluster with the yellow or red status of the OpenSearch cluster. The characteristic symptom of the issue is that in the stacklight namespace, the statefulset.apps/opensearch-master containers are experiencing throttling with the KubeContainersCPUThrottlingHigh alert firing for the following set of labels:

{created_by_kind="StatefulSet",created_by_name="opensearch-master",namespace="stacklight"}

Caution

The throttling that OpenSearch is experiencing may be a temporary situation, which may be related, for example, to a peaky load and the ongoing shards initialization as part of disaster recovery or after node restart. In this case, Mirantis recommends waiting until initialization of all shards is finished. After that, verify the cluster state and whether throttling still exists. And only if throttling does not disappear, apply the workaround below.

To verify that the initialization of shards is ongoing:

kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash

curl "http://localhost:9200/_cat/shards" | grep INITIALIZING

Example of system response:

.ds-system-000072    2 r INITIALIZING    10.232.182.135 opensearch-master-1
.ds-system-000073    1 r INITIALIZING    10.232.7.145   opensearch-master-2
.ds-system-000073    2 r INITIALIZING    10.232.182.135 opensearch-master-1
.ds-audit-000001     2 r INITIALIZING    10.232.7.145   opensearch-master-2

The system response above indicates that shards from the .ds-system-000072, .ds-system-000073, and .ds-audit-000001 indicies are in the INITIALIZING state. In this case, Mirantis recommends waiting until this process is finished, and only then consider changing the limit.

You can additionally analyze the exact level of throttling and the current CPU usage on the Kubernetes Containers dashboard in Grafana.

Workaround:

Verify the currently configured CPU requests and limits for the opensearch containers:

kubectl -n stacklight get statefulset.apps/opensearch-master -o jsonpath="{.spec.template.spec.containers[?(@.name=='opensearch')].resources}"

Example of system response:

{"limits":{"cpu":"600m","memory":"8Gi"},"requests":{"cpu":"500m","memory":"6Gi"}}

In the example above, the CPU request is 500m and the CPU limit is 600m.

Increase the CPU limit to a reasonably high number.

For example, the default CPU limit for the clusters with the clusterSize:large parameter set was increased from 8000m to 12000m for StackLight in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0).

Note

For details, on the clusterSize parameter, see Operations Guide: StackLight configuration parameters - Cluster size.

If the defaults are already overridden on the affected cluster using the resourcesPerClusterSize or resources parameters as described in Operations Guide: StackLight configuration parameters - Resource limits, then the exact recommended number depends on the currently set limit.

Mirantis recommends increasing the limit by 50%. If it does not resolve the issue, another increase iteration will be required.
When you select the required CPU limit, increase it as described in Operations Guide: StackLight configuration parameters - Resource limits.

If the CPU limit for the opensearch component is already set, increase it in the Cluster object for the opensearch parameter. Otherwise, the default StackLight limit is used. In this case, increase the CPU limit for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU limits and become running and ready.

To verify the current CPU limit for every opensearch container in every opensearch-master pod separately:
```
kubectl -n stacklight get pod/opensearch-master-<podSuffixNumber> -o jsonpath="{.spec.containers[?(@.name=='opensearch')].resources}"
```
In the command above, replace <podSuffixNumber> with the name of the pod suffix. For example, pod/opensearch-master-0 or pod/opensearch-master-2.

Example of system response:
```
{"limits":{"cpu":"900m","memory":"8Gi"},"requests":{"cpu":"500m","memory":"6Gi"}}
```
The waiting time may take up to 20 minutes depending on the cluster size.

If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops firing immediately, while OpenSearchClusterStatusWarning or OpenSearchClusterStatusCritical can still be firing for some time during shard relocation.

If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with another iteration of the CPU limit increase.

[40020] Rollover policy update is not appllied to the current index¶

Fixed in 17.2.0, 16.2.0, 17.1.6, 16.1.6

While updating rollover_policy for the current system* and audit* data streams, the update is not applied to indices.

One of indicators that the cluster is most likely affected is the KubeJobFailed alert firing for the elasticsearch-curator job and one or both of the following errors being present in elasticsearch-curator pods that remain in the Error status:

2024-05-31 13:16:04,459 ERROR   Failed to complete action: delete_indices.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: RequestError(400, 'illegal_argument_exception', 'index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')

2024-05-31 13:16:04,459 ERROR   Failed to complete action: delete_indices.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: RequestError(400, 'illegal_argument_exception', 'index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')

Note

Instead of .ds-audit-000001 or .ds-system-000001 index names, similar names can be present with the same prefix but different suffix numbers.

If the above mentioned alert and errors are present, an immediate action is required, because it indicates that the corresponding index size has already exceeded the space allocated for the index.

To verify that the cluster is affected:

Caution

Verify and apply the workaround to both index patterns, system and audit, separately.

If one of indices is affected, the second one is most likely affected as well. Although in rare cases, only one index may be affected.

kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash

Verify that the rollover policy is present:
- system:
```
curl localhost:9200/_plugins/_ism/policies/system_rollover_policy
```
- audit:
```
curl localhost:9200/_plugins/_ism/policies/audit_rollover_policy
```
The cluster is affected if the rollover policy is missing. Otherwise, proceed to the following step.

Verify the system response from the previous step. For example:

{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}

Verify and capture the following items separately for every policy:

The _seq_no and _primary_term values
The rollover policy threshold, which is defined in policy.states[0].actions[0].rollover.min_size

List indices:

system:

curl localhost:9200/_cat/indices | grep system

Example of system response:

[...]
green open .ds-system-000001   FjglnZlcTKKfKNbosaE9Aw 2 1 1998295  0   1gb 507.9mb

audit:

curl localhost:9200/_cat/indices | grep audit

Example of system response:

[...]
green open .ds-audit-000001   FjglnZlcTKKfKNbosaE9Aw 2 1 1998295  0   1gb 507.9mb

Select the index with the highest number and verify the rollover policy attached to the index:
- system:
```
curl localhost:9200/_plugins/_ism/explain/.ds-system-000001
```
- audit:
```
curl localhost:9200/_plugins/_ism/explain/.ds-audit-000001
```
- If the rollover policy is not attached, the cluster is affected.
- If the rollover policy is attached but _seq_no and _primary_term numbers do not match the previously captured ones, the cluster is affected.
- If the index size drastically exceeds the defined threshold of the rollover policy (which is the previously captured min_size), the cluster is most probably affected.

Workaround:

kubectl exec -it pod/opensearch-master-0 -n stacklight -c opensearch -- bash

If the policy is attached to the index but has different _seq_no and _primary_term, remove the policy from the index:

Note

Use the index with the highest number in the name, which was captured during verification procedure.
- system:
```
curl -XPOST localhost:9200/_plugins/_ism/remove/.ds-system-000001
```
- audit:
```
curl -XPOST localhost:9200/_plugins/_ism/remove/.ds-audit-000001
```

Re-add the policy:

system:

curl -XPOST -H "Content-type: application/json" localhost:9200/_plugins/_ism/add/system* -d'{"policy_id":"system_rollover_policy"}'

audit:

curl -XPOST -H "Content-type: application/json" localhost:9200/_plugins/_ism/add/audit* -d'{"policy_id":"audit_rollover_policy"}'

Perform again the last step of the cluster verification procedure provided above and make sure that the policy is attached to the index and has the same _seq_no and _primary_term.

If the index size drastically exceeds the defined threshold of the rollover policy (which is the previously captured min_size), wait up to 15 minutes and verify that the additional index is created with the consecutive number in the index name. For example:
- system: if you applied changes to .ds-system-000001, wait until .ds-system-000002 is created.
- audit: if you applied changes to .ds-audit-000001, wait until .ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.

Update¶

[36928] The helm-controller Deployment is stuck during cluster update¶

During a cluster update, a Kubernetes helm-controller Deployment may get stuck in a restarting Pod loop with Terminating and Running states flapping. Other Deployment types may also be affected.

As a workaround, restart the Deployment that got stuck:

kubectl -n <affectedProjectName> get deploy <affectedDeployName> -o yaml

kubectl -n <affectedProjectName> scale deploy <affectedDeployName> --replicas 0

kubectl -n <affectedProjectName> scale deploy <affectedDeployName> --replicas <replicasNumber>

In the command above, replace the following values:

<affectedProjectName> is the Container Cloud project name containing the cluster with stuck Pods
<affectedDeployName> is the Deployment name that failed to run Pods in the specified project
<replicasNumber> is the original number of replicas for the Deployment that you can obtain using the get deploy command