Known issues¶

This section lists known issues with workarounds for the Mirantis Container Cloud release 2.5.0 including the Cluster release 5.12.0 and 6.12.0.

Note

This section also outlines still valid known issues from previous Container Cloud releases.

AWS
vSphere
Bare metal

Storage
IAM
LCM

StackLight
Management and regional clusters
Container Cloud web UI

AWS¶

[8013] Managed cluster deployment requiring PVs may fail¶

^{Fixed in the Cluster release 7.0.0}

Note

The issue below affects only the Kubernetes 1.18 deployments. Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

On a management cluster with multiple AWS-based managed clusters, some clusters fail to complete the deployments that require persistent volumes (PVs), for example, Elasticsearch. Some of the affected pods get stuck in the Pending state with the pod has unbound immediate PersistentVolumeClaims and node(s) had volume node affinity conflict errors.

Warning

The workaround below applies to HA deployments where data can be rebuilt from replicas. If you have a non-HA deployment, back up any existing data before proceeding, since all data will be lost while applying the workaround.

Workaround:

Obtain the persistent volume claims related to the storage mounts of the affected pods:
```
kubectl get pod/<pod_name1> pod/<pod_name2> \
-o jsonpath='{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}'
```
Note

In the command above and in the subsequent steps, substitute the parameters enclosed in angle brackets with the corresponding values.

Delete the affected Pods and PersistentVolumeClaims to reschedule them: For example, for StackLight:

kubectl -n stacklight delete \

  pod/<pod_name1> pod/<pod_name2> ...
  pvc/<pvc_name2> pvc/<pvc_name2> ...

vSphere¶

[11633] A vSphere-based project cannot be cleaned up¶

^{Fixed in Container Cloud 2.6.0}

A vSphere-based managed cluster project can fail to be cleaned up because of stale secret(s) related to the RHEL license object(s). Before you can successfully clean up such project, manually delete the secret using the steps below.

Workaround:

Obtain the list of stale secrets:

kubectl --kubeconfig <kubeconfigPath> get secrets -n <projectName>

Open each secret for editing:

kubectl --kubeconfig <kubeconfigPath> edit secret <secret name> -n <projectName>

Remove the following lines:

finalizers:
- kaas.mirantis.com/credentials-secret

Remove stale secrets:

kubectl --kubeconfig <kubeconfigPath> delete secret <secretName> -n <projectName>

Bare metal¶

[7655] Wrong status for an incorrectly configured L2 template¶

^{Fixed in 2.11.0}

If an L2 template is configured incorrectly, a bare metal cluster is deployed successfully but with the runtime errors in the IpamHost object.

Workaround:

If you suspect that the machine is not working properly because of incorrect network configuration, verify the status of the corresponding IpamHost object. Inspect the l2RenderResult and ipAllocationResult object fields for error messages.

[9875] Full preflight fails with a timeout waiting for BareMetalHost¶

^{Fixed in Container Cloud 2.6.0}

If you run bootstrap.sh preflight with KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:

failed to create BareMetal objects: failed to wait for objects of kinds BareMetalHost
to become available: timed out waiting for the condition

As a workaround, unset full preflight using unset KAAS_BM_FULL_PREFLIGHT to run fast preflight instead.

[11468] Pods using LVP PV are not mounted to LVP disk¶

^{Fixed in Container Cloud 2.6.0}

The persistent volumes (PVs) that are created using local volume provisioner (LVP), are not mounted on the dedicated disk labeled as local-volume and use the root volume instead. In the workaround below, we use StackLight volumes as an example.

Workaround:

Identify whether your cluster is affected:
1. Log in to any control plane node on the management cluster.
2. Run the following command:
```
findmnt /mnt/local-volumes/stacklight/elasticsearch-data/vol00
```
  In the output, inspect the SOURCE column. If the path starts with /dev/mapper/lvm_root-root, the host is affected by the issue.
  
  Example of system response:
```
TARGET                                                 SOURCE                                                                                FSTYPE OPTIONS
/mnt/local-volumes/stacklight/elasticsearch-data/vol00 /dev/mapper/lvm_root-root[/var/lib/local-volumes/stacklight/elasticsearch-data/vol00] ext4   rw,relatime,errors=remount-ro,data=ordered
```
3. Verify other StackLight directories by replacing elasticsearch-data in the command above with the corresponding folders names.
  
  If your cluster is affected, follow the steps below to manually move all data for volumes that must be on the dedicated disk to the mounted device.
Identify all nodes that run the elasticsearch-master pod:
```
kubectl -n stacklight get pods -o wide | grep elasticsearch-master
```
Apply the steps below to all nodes provided in the output.
Identify the mount point for the dedicated device /dev/mapper/lvm_lvp-lvp. Typically, this device is mounted as /mnt/local-volumes.
```
findmnt /mnt/local-volumes
```
Verify that SOURCE for the /mnt/local-volumes mount target is /dev/mapper/lvm_lvp-lvp on all the nodes.
Create new source directories for the volumes on the dedicated device /dev/mapper/lvm_lvp-lvp:
```
mkdir -p /mnt/local-volumes/src/stacklight/elasticsearch-data/vol00
```
Stop the pods that use the volumes to ensure that the data is not corrupted during the switch. Set the number of replicas in StatefulSet to 0:
```
kubectl -n stacklight edit statefulset elasticsearch-master
```
Wait until all elasticsearch-master pods are stopped.

Move the Elasticsearch data from the current location to the new directory:

cp -pR /var/lib/local-volumes/stacklight/elasticsearch-data/vol00/** /mnt/local-volumes/src/stacklight/elasticsearch-data/vol00/

Unmount the old source directory from the volume mount point:
```
umount /mnt/local-volumes/stacklight/elasticsearch-data/vol00
```
Apply this step and the next one to every node with the /mnt/local-volumes/stacklight/elasticsearch-data/vol00 volume.

Remount the new source directory to the volume mount point:

mount --bind /mnt/local-volumes/src/stacklight/elasticsearch-data/vol00 /mnt/local-volumes/stacklight/elasticsearch-data/vol00

Edit the Cluster object by adding the highlighted parameters below for the StackLight Helm chart:

kubectl --kubeconfig <mgmtClusterKubeconfig> edit -n <projectName> cluster <managedClusterName>

spec:
  helmReleases:
  - name: stacklight
    values:
      ...
      elasticsearch:
        clusterHealthCheckParams: wait_for_status=red&timeout=1s

Start the Elasticsearch pods by setting the number of replicas in StatefulSet to 3:
```
kubectl -n stacklight edit statefulset elasticsearch-master
```
Wait until all elasticsearch-master pods are up and running.
Remove the previously added clusterHealthCheckParams parameters from the Cluster object.
In /etc/fstab on every node that has the volume /mnt/local-volumes/stacklight/elasticsearch-data/vol00, edit the following entry:
```
/var/lib/local-volumes/stacklight/elasticsearch-data/vol00 /mnt/local-volumes/stacklight/elasticsearch-data/vol00 none bind 0 0
```
In this entry, replace the old directory /var/lib/local-volumes/stacklight/elasticsearch-data/vol00 with the new one: /mnt/local-volumes/src/stacklight/elasticsearch-data/vol00.

Storage¶

[10060] Ceph OSD node removal fails¶

^{Fixed in Container Cloud 2.7.0}

A Ceph node removal is not being triggered properly after updating the KaasCephCluster custom resource (CR). Both management and managed clusters are affected.

Workaround:

Remove the parameters for a Ceph OSD from the KaasCephCluster CR as described in Operations Guide: Add, remove, or reconfigure Ceph nodes.

Obtain the IDs of the osd and mon services that are located on the old node:

Obtain the UID of the affected machine:

kubectl get machine <CephOSDNodeName> -n <ManagedClusterProjectName> -o jsonpath='{.metadata.annotations.kaas\.mirantis\.com\/uid}'

Export kubeconfig of your managed cluster. For example:

export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml

Identify the pods IDs that run the osd and mon services:

kubectl get pods -o wide -n rook-ceph | grep <affectedMachineUID> | grep -E "mon|osd"

Example of the system response extract:

rook-ceph-mon-c-7bbc5d757d-5bpws                              1/1  Running    1  6h1m
rook-ceph-osd-2-58775d5568-5lklw                              1/1  Running    4  44h
rook-ceph-osd-prepare-705ae6c647cfdac928c63b63e2e2e647-qn4m9  0/1  Completed  0  94s

The pods IDs include the osd or mon services IDs. In the example system response above, the osd ID is 2 and the mon ID is c.

Delete the deployments of the osd and mon services obtained in the previous step:

kubectl delete deployment rook-ceph-osd(mon)-<ID> -n rook-ceph

For example:

kubectl delete deployment rook-ceph-mon-c -n rook-ceph
kubectl delete deployment rook-ceph-osd-2 -n rook-ceph

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

Rebalance the Ceph OSDs:
```
ceph osd out osd(s).ID
```
Wait for the rebalance to complete.
Rebalance the Ceph data:
```
ceph osd purge osd(s).ID
```
Wait for the Ceph data to rebalance.
Remove the old node from the Ceph OSD tree:
```
ceph osd crush rm <NodeName>
```
If the removed node contained mon services, remove them:
```
ceph mon rm <monID>
```

[7073] Cannot automatically remove a Ceph node¶

Fixed in 2.16.0

When removing a worker node, it is not possible to automatically remove a Ceph node. The workaround is to manually remove the Ceph node from the Ceph cluster as described in Operations Guide: Add, remove, or reconfigure Ceph nodes before removing the worker node from your deployment.

[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶

^{Fixed in 2.11.0}

If you use a custom BareMetalHostProfile, after disk replacement on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state due to the Ceph OSD authorization key failing to be created properly.

Workaround:

Export kubeconfig of your managed cluster. For example:

export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

Delete the authorization key for the failed Ceph OSD:
```
ceph auth del osd.<ID>
```
SSH to the node on which the Ceph OSD cannot be created.
Clean up the disk that will be a base for the failed Ceph OSD. For details, see official Rook documentation.

Note

Ignore failures of the sgdisk --zap-all $DISK and blkdiscard $DISK commands if any.

On the managed cluster, restart Rook Operator:

kubectl -n rook-ceph delete pod -l app=rook-ceph-operator

IAM¶

[10829] Keycloak pods fail to start during a management cluster bootstrap¶

^{Fixed in Container Cloud 2.6.0}

The Keycloak pods may fail to start during a management cluster bootstrap with the Failed to update database exception in logs.

Caution

The following workaround is applicable only to deployments where mariadb-server has started successfully. Otherwise, fix the issues with MariaDB first.

Workaround:

Verify that mariadb-server has started:

kubectl get po -n kaas | grep mariadb-server

Scale down the Keycloak instances:

kubectl scale sts iam-keycloak --replicas=0 -n kaas

Open the iam-keycloak-sh configmap for editing:

kubectl edit cm -n kaas iam-keycloak-sh

On the last line of the configmap, before the $MIGRATION_ARGS variable, add the following parameter:
```
-Djboss.as.management.blocking.timeout=<RequiredValue>
```
The recommended timeout value is minimum 15 minutes set in seconds. For example, -Djboss.as.management.blocking.timeout=900.

Open the iam-keycloak-startup configmap for editing:

kubectl edit cm -n kaas iam-keycloak-startup

In the iam-keycloak-startup configmap, add the following line:
```
/subsystem=transactions/:write-attribute(name=default-timeout,value=<RequiredValue>)
```
The recommended timeout value is minimum 15 minutes set in seconds.
In the Keycloak StatefulSet, adjust liveness probe timeouts:
```
kubectl edit sts -n kaas iam-keycloak
```

Scale up the Keycloak instances:

kubectl scale sts iam-keycloak --replicas=3 -n kaas

LCM¶

[10029] Authentication fails with the 401 Unauthorized error¶

Authentication may not work on some controller nodes after a managed cluster creation. As a result, the Kubernetes API operations with the managed cluster kubeconfig fail with Response Status: 401 Unauthorized.

As a workaround, manually restart the ucp-controller and ucp-auth Docker services on the affected node.

Note

Moving forward, the workaround for this issue will be moved from Release Notes to Operations Guide: Troubleshooting.

[6066] Helm releases get stuck in FAILED or UNKNOWN state¶

Note

The issue affects only Helm v2 releases and is addressed for Helm v3. Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.

During a management, regional, or managed cluster deployment, Helm releases may get stuck in the FAILED or UNKNOWN state although the corresponding machines statuses are Ready in the Container Cloud web UI. For example, if the StackLight Helm release fails, the links to its endpoints are grayed out in the web UI. In the cluster status, providerStatus.helm.ready and providerStatus.helm.releaseStatuses.<releaseName>.success are false.

HelmBundle cannot recover from such states and requires manual actions. The workaround below describes the recovery steps for the stacklight release that got stuck during a cluster deployment. Use this procedure as an example for other Helm releases as required.

Workaround:

Verify the failed release has the UNKNOWN or FAILED status in the HelmBundle object:

kubectl --kubeconfig <regionalClusterKubeconfigPath> get helmbundle <clusterName> -n <clusterProjectName> -o=jsonpath={.status.releaseStatuses.stacklight}

In the command above and in the steps below, replace the parameters
enclosed in angle brackets with the corresponding values of your cluster.

Example of system response:

stacklight:
attempt: 2
chart: ""
finishedAt: "2021-02-05T09:41:05Z"
hash: e314df5061bd238ac5f060effdb55e5b47948a99460c02c2211ba7cb9aadd623
message: '[{"occurrence":1,"lastOccurrenceDate":"2021-02-05 09:41:05","content":"error
  updating the release: rpc error: code = Unknown desc = customresourcedefinitions.apiextensions.k8s.io
  \"helmbundles.lcm.mirantis.com\" already exists"}]'
notes: ""
status: UNKNOWN
success: false
version: 0.1.2-mcp-398

kubectl --kubeconfig <affectedClusterKubeconfigPath> exec -n kube-system -it helm-controller-0 sh -c tiller

Download the Helm v3 binary. For details, see official Helm documentation.
Remove the failed release:
```
helm delete <failed-release-name>
```
For example:
```
helm delete stacklight
```
Once done, the release triggers for redeployment.

StackLight¶

[11001] Patroni pod fails to start¶

^{Fixed in Container Cloud 2.6.0}

After the management cluster update, a Patroni pod may fail to start and remain in the CrashLoopBackOff status. Messages similar to the following ones may be present in Patroni logs:

Local timeline=4 lsn=0/A000000
master_timeline=6
master: history=1 0/1ADEB48       no recovery target specified
2       0/8044500       no recovery target specified
3       0/A0000A0       no recovery target specified
4       0/A1B6CB0       no recovery target specified
5       0/A2C0C80       no recovery target specified

As a workaround, reinitialize the affected pod with a new volume by deleting the pod itself and the associated PersistentVolumeClaim (PVC).

Workaround:

Obtain the PVC of the affected pod:

kubectl -n stacklight get "pod/${POD_NAME}" -o jsonpath='{.spec.volumes[?(@.name=="storage-volume")].persistentVolumeClaim.claimName}'

Delete the affected pod and its PVC:

kubectl -n stacklight delete "pod/${POD_NAME}" "pvc/${POD_PVC}"
sleep 3  # wait for StatefulSet to reschedule the pod, but miss dependent PVC creation
kubectl -n stacklight delete "pod/${POD_NAME}"

Management and regional clusters¶

[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶

Fixed in 2.14.0

Helm releases may get stuck in the PENDING_UPGRADE status during a management or managed cluster upgrade. The HelmBundle Controller cannot recover from this state and requires manual actions. The workaround below describes the recovery process for the openstack-operator release that stuck during a managed cluster update. Use it as an example for other Helm releases as required.

Workaround:

kubectl exec -n kube-system -it helm-controller-0 sh -c tiller

Identify the release that stuck in the PENDING_UPGRADE status. For example:

./helm --host=localhost:44134 history openstack-operator

Example of system response:

REVISION  UPDATED                   STATUS           CHART                      DESCRIPTION
       Tue Dec 15 12:30:41 2020  SUPERSEDED       openstack-operator-0.3.9   Install complete
       Tue Dec 15 12:32:05 2020  SUPERSEDED       openstack-operator-0.3.9   Upgrade complete
       Tue Dec 15 16:24:47 2020  PENDING_UPGRADE  openstack-operator-0.3.18  Preparing upgrade

Roll back the failed release to the previous revision:
1. Download the Helm v3 binary. For details, see official Helm documentation.
2. Roll back the failed release:
```
helm rollback <failed-release-name>
```
  For example:
```
helm rollback openstack-operator 2
```
Once done, the release will be reconciled.

[10424] Regional cluster cleanup fails by timeout¶

An OpenStack-based regional cluster cleanup fails with the timeout error.

Workaround:

Wait for the Cluster object to be deleted in the bootstrap cluster:
```
kubectl --kubeconfig <(./bin/kind get kubeconfig --name clusterapi) get cluster
```
The system output must be empty.

Remove the bootstrap cluster manually:

./bin/kind delete cluster --name clusterapi

Container Cloud web UI¶

[249] A newly created project does not display in the Container Cloud web UI¶

Affects only Container Cloud 2.18.0 and earlier

A project that is newly created in the Container Cloud web UI does not display in the Projects list even after refreshing the page. The issue occurs due to the token missing the necessary role for the new project. As a workaround, relogin to the Container Cloud web UI.