Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.20.0 including the Cluster releases 11.4.0 and 7.10.0.

For other issues that can occur while deploying and operating a Container Cloud cluster, see Deployment Guide: Troubleshooting and Operations Guide: Troubleshooting.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

MKE
Bare metal
Equinix Metal with private networking
vSphere

StackLight
Ceph
Management cluster upgrade
Container Cloud web UI

MKE¶

[20651] A cluster deployment or update fails with not ready compose deployments¶

A managed cluster deployment, attachment, or update to a Cluster release with MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the compose pods flapping (ready > terminating > pending) and with the following error message appearing in logs:

'not ready: deployments: kube-system/compose got 0/0 replicas, kube-system/compose-api
 got 0/0 replicas'
 ready: false
 type: Kubernetes

Workaround:

Disable Docker Content Trust (DCT):
1. Access the MKE web UI as admin.
2. Navigate to Admin > Admin Settings.
3. In the left navigation pane, click Docker Content Trust and disable it.
Restart the affected deployments such as calico-kube-controllers, compose, compose-api, coredns, and so on:
```
kubectl -n kube-system delete deployment <deploymentName>
```
Once done, the cluster deployment or update resumes.
Re-enable DCT.

Bare metal¶

[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods¶

Fixed in 11.6.0

Deployment of a regional cluster based on bare metal or Equinix Metal with private networking fails with mcc-cache Pods being stuck in the CrashLoopBackOff status of restarts.

As a workaround, remove failed mcc-cache Pods to restart them automatically. For example:

kubectl -n kaas delete pod mcc-cache-0

[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶

During deletion of a manager machine running the ironic Pod from a bare metal management cluster, the following problems occur:

All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state

As a workaround, before deletion of the node running the ironic Pod, cordon and drain the node using the kubectl cordon <nodeName> and kubectl drain <nodeName> commands.

[20736] Region deletion failure after regional deployment failure¶

If a baremetal-based regional cluster deployment fails before pivoting is done, the corresponding region deletion fails.

Workaround:

Using the command below, manually delete all possible traces of the failed regional cluster deployment, including but not limited to the following objects that contain the kaas.mirantis.com/region label of the affected region:

cluster
machine
baremetalhost
baremetalhostprofile
l2template
subnet
ipamhost
ipaddr

kubectl delete <objectName> -l kaas.mirantis.com/region=<regionName>

Warning

Do not use the same region name again after the regional cluster deployment failure since some objects that reference the region name may still exist.

Equinix Metal with private networking¶

[26659] Regional cluster deployment failure with stuck ‘mcc-cache’ Pods¶

Fixed in 11.6.0

Deployment of a regional cluster based on bare metal or Equinix Metal with private networking fails with mcc-cache Pods being stuck in the CrashLoopBackOff status of restarts.

As a workaround, remove failed mcc-cache Pods to restart them automatically. For example:

kubectl -n kaas delete pod mcc-cache-0

vSphere¶

[26070] RHEL system cannot be registered in Red Hat portal over MITM proxy¶

Deployment of RHEL machines using the Red Hat portal registration, which requires user and password credentials, over MITM proxy fails while building the virtual machines template with the following error:

Unable to verify server's identity: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:618)

The Container Cloud deployment gets stuck while applying the RHEL license to machines with the same error in the lcm-agent logs.

As a workaround, use the internal Red Hat Satellite server that a VM can access directly without a MITM proxy.

StackLight¶

[28526] CPU throttling for ‘kaas-exporter’ blocking metric collection¶

Fixed in 11.6.0 and 12.7.0

A low CPU limit 100m for kaas-exporter blocks metric collection.

As a workaround, increase the CPU limit for kaas-exporter to 500m on the management cluster in the spec:providerSpec:value:kaas:management:helmReleases: section as described in {{ mos_name_abbr }} documentation: Underlay Kubernetes operations - Increase memory limits for cluster components.

[27732-1] OpenSearch PVC size custom settings are dismissed during deployment¶

Fixed in 11.6.0 and 12.7.0

The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is overwritten by logging.persistentVolumeClaimSize during deployment of a Container Cloud cluster of any type and is set to the default 30Gi.

Note

This issue does not block the OpenSearch cluster operations if the default retention time is set. The default setting is usually enough for the capacity size of this cluster.

The issue may affect the following Cluster releases:

11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)

To verify that the cluster is affected:

Note

In the commands below, substitute parameters enclosed in angle brackets to match the affected cluster values.

kubectl --kubeconfig=<managementClusterKubeconfigPath> \
-n <affectedClusterProjectName> \
get cluster <affectedClusterName> \
-o=jsonpath='{.spec.providerSpec.value.helmReleases[*].values.elasticsearch.persistentVolumeClaimSize}' | xargs echo config size:


kubectl --kubeconfig=<affectedClusterKubeconfigPath> \
-n stacklight get pvc -l 'app=opensearch-master' \
-o=jsonpath="{.items[*].status.capacity.storage}" | xargs echo capacity sizes:

The cluster is not affected if the configuration size value matches or is less than any capacity size. For example:

config size: 30Gi
capacity sizes: 30Gi 30Gi 30Gi

config size: 50Gi
capacity sizes: 100Gi 100Gi 100Gi

The cluster is affected if the configuration size is larger than any capacity size. For example:
```
config size: 200Gi
capacity sizes: 100Gi 100Gi 100Gi
```

Workaround for a new cluster creation:

Select from the following options:
- For a management or regional cluster, during the bootstrap procedure, open cluster.yaml.template for editing.
- For a managed cluster, open the Cluster object for editing.
  
  Caution
  
  For a managed cluster, use the Container Cloud API instead of the web UI for cluster creation.

In the opened .yaml file, add logging.persistentVolumeClaimSize along with elasticsearch.persistentVolumeClaimSize. For example:

apiVersion: cluster.k8s.io/v1alpha1
spec:
...
  providerSpec:
    value:
    ...
      helmReleases:
      - name: stacklight
        values:
          elasticsearch:
            persistentVolumeClaimSize: 100Gi
          logging:
            enabled: true
            persistentVolumeClaimSize: 100Gi

Continue the cluster deployment. The system will use the custom value set in logging.persistentVolumeClaimSize.

Caution

If elasticsearch.persistentVolumeClaimSize is absent in the .yaml file, the Admission Controller blocks the configuration update.

Workaround for an existing cluster:

Caution

During the application of the below workarounds, a short outage of OpenSearch and its dependent components may occur with the following alerts firing on the cluster. This behavior is expected. Therefore, disregard these alerts.

StackLight alerts list firing during cluster update

Cluster size and outage probability level	Alert name	Label name and component
Any cluster with high probability	`KubeStatefulSetOutage`	`statefulset=opensearch-master`
	`KubeDeploymentOutage`	`deployment=opensearch-dashboards` `deployment=metricbeat`
Large cluster with average probability	`KubePodsNotReady` ^{Removed in 17.0.0, 16.0.0, and 14.1.0}	`created_by_name="opensearch-master"` `created_by_name="opensearch-dashboards"` `created_by_name="metricbeat-*"`
	`OpenSearchClusterStatusWarning`	n/a
	`OpenSearchNumberOfPendingTasks`	n/a
	`OpenSearchNumberOfInitializingShards`	n/a
	`OpenSearchNumberOfUnassignedShards` ^{Removed in 2.27.0 (17.2.0 and 16.2.0)}	n/a
Any cluster with low probability	`KubeStatefulSetReplicasMismatch`	`statefulset=opensearch-master`
	`KubeDeploymentReplicasMismatch`	`deployment=opensearch-dashboards` `deployment=metricbeat`

StackLight in non-HA mode with a non-expandable StorageClass and no LVP for OpenSearch PVCs

Warning

After applying this workaround, the existing log data will be lost. Depending on your custom provisioner, you may find a third-party tool, such as pv-migrate, that provides a possibility to copy all data from one PV to another.

If data loss is acceptable, proceed with the workaround below.

Note

To verify whether a StorageClass is expandable:

kubectl -n stacklight get pvc | grep opensearch-master | awk '{print $6}' | xargs -I{} kubectl get storageclass {} -o yaml | grep 'allowVolumeExpansion: true'

A positive system response is allowVolumeExpansion: true. A negative system response is blank or false.

Scale down the opensearch-master StatefulSet with dependent resources to 0 and disable the elasticsearch-curator CronJob:

kubectl -n stacklight scale --replicas 0 statefulset opensearch-master

kubectl -n stacklight scale --replicas 0 deployment opensearch-dashboards

kubectl -n stacklight scale --replicas 0 deployment metricbeat

kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : true }}'

Recreate the opensearch-master StatefulSet with the updated disk size:

kubectl get statefulset opensearch-master -o yaml -n stacklight | sed 's/storage: 30Gi/storage: <<pvc_size>>/g' > opensearch-master.yaml

kubectl -n stacklight delete statefulset opensearch-master

kubectl create -f opensearch-master.yaml

Replace <pvcSize> with the elasticsearch.persistentVolumeClaimSize value.

Delete existing PVCs:
```
kubectl delete pvc -l 'app=opensearch-master' -n stacklight
```
Warning

This command removes all existing logs data from PVCs.

In the Cluster configuration, set logging.persistentVolumeClaimSize to the same value as the size of the elasticsearch.persistentVolumeClaimSize parameter. For example:

 apiVersion: cluster.k8s.io/v1alpha1
 kind: Cluster
 spec:
 ...
   providerSpec:
     value:
     ...
       helmReleases:
       - name: stacklight
         values:
           elasticsearch:
             persistentVolumeClaimSize: 100Gi
           logging:
             enabled: true
             persistentVolumeClaimSize: 100Gi

Scale up the opensearch-master StatefulSet with dependent resources to 1 and enable the elasticsearch-curator CronJob:

kubectl -n stacklight scale --replicas 1 statefulset opensearch-master

sleep 100

kubectl -n stacklight scale --replicas 1 deployment opensearch-dashboards

kubectl -n stacklight scale --replicas 1 deployment metricbeat

kubectl -n stacklight patch cronjobs elasticsearch-curator -p '{"spec" : {"suspend" : false }}'

[27732-2] Custom settings for ‘elasticsearch.logstashRetentionTime’ are dismissed¶

Fixed in 11.6.0 and 12.7.0

Custom settings for the deprecated elasticsearch.logstashRetentionTime parameter are overwritten by the default setting set to 1 day.

The issue may affect the following Cluster releases with enabled elasticsearch.logstashRetentionTime:

11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)

As a workaround, in the Cluster object, replace elasticsearch.logstashRetentionTime with elasticsearch.retentionTime that was implemented to replace the deprecated parameter. For example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
spec:
  ...
  providerSpec:
    value:
    ...
      helmReleases:
      - name: stacklight
        values:
          elasticsearch:
            retentionTime:
              logstash: 10
              events: 10
              notifications: 10
          logging:
            enabled: true

For the StackLight configuration procedure and parameters description, refer to Configure StackLight.

[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshoot StackLight.

On a managed cluster, the StackLight pods may get stuck with the Pod predicate NodeAffinity failed error in the pod status. The issue may occur if the StackLight node label was added to one machine and then removed from another one.

The issue does not affect the StackLight services, all required StackLight pods migrate successfully except extra pods that are created and stuck during pod migration.

As a workaround, remove the stuck pods:

kubectl --kubeconfig <managedClusterKubeconfig> -n stacklight delete pod <stuckPodName>

Ceph¶

[26820] ‘KaaSCephCluster’ does not reflect issues during Ceph cluster deletion¶

Fixed in 2.22.0

The status section in the KaaSCephCluster.status CR does not reflect issues during the process of a Ceph cluster deletion.

As a workaround, inspect Ceph Controller logs on the managed cluster:

kubectl --kubeconfig <managedClusterKubeconfig> -n ceph-lcm-mirantis logs <ceph-controller-pod-name>

[26441] Cluster update fails with the MountDevice failed for volume warning¶

Update of a managed cluster based on bare metal and Ceph enabled fails with PersistentVolumeClaim getting stuck in the Pending state for the prometheus-server StatefulSet and the MountVolume.MountDevice failed for volume warning in the StackLight event logs.

Workaround:

Verify that the description of the Pods that failed to run contain the FailedMount events:
```
kubectl -n <affectedProjectName> describe pod <affectedPodName>
```
In the command above, replace the following values:
- <affectedProjectName> is the Container Cloud project name where the Pods failed to run
- <affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the rbd volume mount failed: <csi-vol-uuid> is being used error. The <csi-vol-uuid> is a unique RBD volume name.
1. Identify csiPodName of the corresponding csi-rbdplugin:
```
kubectl -n rook-ceph get pod -l app=csi-rbdplugin \
-o jsonpath='{.items[?(@.spec.nodeName == "<nodeName>")].metadata.name}'
```
2. Output the affected csiPodName logs:
```
kubectl -n rook-ceph logs <csiPodName> -c csi-rbdplugin
```
Scale down the affected StatefulSet or Deployment of the Pod that fails to 0 replicas.

On every csi-rbdplugin Pod, search for stuck csi-vol:

for pod in `kubectl -n rook-ceph get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do
  echo $pod
  kubectl exec -it -n rook-ceph $pod -c csi-rbdplugin -- rbd device list | grep <csi-vol-uuid>
done

Unmap the affected csi-vol:
```
rbd unmap -o force /dev/rbd<i>
```
The /dev/rbd<i> value is a mapped RBD volume that uses csi-vol.

Delete volumeattachment of the affected Pod:

kubectl get volumeattachments | grep <csi-vol-uuid>
kubectl delete volumeattacmhent <id>

Scale up the affected StatefulSet or Deployment back to the original number of replicas and wait until its state becomes Running.

Management cluster upgrade¶

[26740] Failure to upgrade a management cluster with a custom certificate¶

Fixed in 2.21.0

An upgrade of a Container Cloud management cluster with a custom Keycloak or web UI TLS certificate fails with the following example error:

failed to update management cluster: \
admission webhook "validations.kaas.mirantis.com" denied the request: \
failed to validate TLS spec for Cluster 'default/kaas-mgmt': \
desired hostname is not set for 'ui'

Workaround:

Verify that the tls section of the management cluster contains the hostname and certificate fields for configured applications:

Open the management Cluster object for editing:
```
kubectl edit cluster <mgmtClusterName>
```

Verify that the tls section contains the following fields:

tls:
  keycloak:
    certificate:
      name: keycloak
    hostname: <keycloakHostName>
    tlsConfigRef: “” or “keycloak”
  ui:
    certificate:
      name: ui
    hostname: <webUIHostName>
    tlsConfigRef: “” or “ui”

Container Cloud web UI¶

[26416] Failure to upload an MKE client bundle during cluster attachment¶

Fixed in 7.11.0, 11.5.0 and 12.5.0

During attachment of an existing MKE cluster using the Container Cloud web UI, uploading of an MKE client bundle fails with a false-positive message about a successful uploading.

Workaround:

Select from the following options:

Fill in the required fields for the MKE client bundle manually.
In the Attach Existing MKE Cluster window, use upload MKE client bundle twice to upload ucp.bundle-admin.zip and ucp-docker-bundle.zip located in the first archive.

[23002] Inability to set a custom value for a predefined node label¶

Fixed in 7.11.0, 11.5.0 and 12.5.0

During machine creation using the Container Cloud web UI, a custom value for a node label cannot be set.

As a workaround, manually add the value to spec.providerSpec.value.nodeLabels in machine.yaml.