Known issues¶
This section lists MOSK known issues with workarounds for the MOSK release 25.1.1. For the known issues in the related Container Cloud release, refer to Mirantis Container Cloud: Release Notes.
Update known issues¶
[42449] Rolling reboot failure on a Tungsten Fabric cluster¶
During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.
[49705] Cluster update is stuck due to unhealthy tf-vrouter-agent-dpdk pods¶
During a MOSK cluster update, the tf-vrouter-agent-dpdk
pods may become
unhealthy due to a failed LivenessProbe, causing the update process to get
stuck. The issue may only affect major updates when the cluster dataplane
components are restarted.
To work around the issue, manually remove the tf-vrouter-agent-dpdk
pods.
OpenStack¶
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:
[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49
Workaround:
Create a backup of the
/var/lib/mysql
directory on themariadb-server
Pod.Verify that other replicas are up and ready.
Remove the
galera.cache
file for the affectedmariadb-server
Pod.Remove the affected
mariadb-server
Pod or wait until it is automatically restarted.
After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue, a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy
value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy
value for both services from Cluster
to
Local
, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>
:
Identify the service that is stuck:
kubectl get svc -A | grep pending
Example of system response:
stacklight iam-proxy-prometheus LoadBalancer 10.233.28.196 <pending> 443:30430/TCP
Add an arbitrary label to the service that is stuck. For example:
kubectl label svc -n stacklight iam-proxy-prometheus reconcile=1
Example of system response:
service/iam-proxy-prometheus labeled
Verify that the external IP was allocated to the service:
kubectl get svc -n stacklight iam-proxy-prometheus
Example of system response:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE iam-proxy-prometheus LoadBalancer 10.233.28.196 10.0.34.108 443:30430/TCP 12d
[53401] Credential rotation reports success without performing action¶
Occasionally, the password rotation procedure for admin
or service
credentials may incorrectly report success without actually initiating
the rotation process. This can result in unchanged credentials despite
the procedure indicating completion.
To work around the issue, restart the rotation procedure and verify that the credentials have been successfully updated.
Tungsten Fabric¶
[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot¶
Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).
To verify that a Cassandra cluster is affected:
Run the nodetool status command specifying the config
or
analytics
cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status
Example of system response with outdated IP addresses:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN <outdated ip> ? 256 64.9% a58343d0-1e3f-4d54-bcdf-9b9b949ca873 r1
DN <outdated ip> ? 256 69.8% 67f1d07c-8b13-4482-a2f1-77fa34e90d48 r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN <actual ip> 3.84 GiB 256 65.2% 7324ebc4-577a-425f-b3de-96faac95a331 rack1
Workaround:
Manually delete the Cassandra pod from the failed config
or analytics
cluster to re-initiate the bootstrap process for one of the Cassandra nodes:
kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>
[40032] tf-rabbitmq fails to start after rolling reboot¶
Occasionally, RabbitMQ instances in tf-rabbitmq
pods fail to enable
the tracking_records_in_ets
during the initialization process.
To work around the problem, restart the affected pods manually.
[42896] Cassandra cluster contains extra node with outdated IP after replacement of TF control node¶
After replacing a failed Tungsten Fabric controller node as described in Replace a failed TF controller node, the first restart of the Cassandra pod on this node may cause an issue if the Cassandra node with the outdated IP address has not been removed from the cluster. Subsequent Cassandra pod restarts should not trigger this problem.
To verify if your Cassandra cluster is affected, run the nodetool status command specifying the config or analytics cluster and the replica number:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool status
Example of the system response with outdated IP addresses:
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.201.144 509.43 KiB 256 ? 7e760a99-fae5-4921-b0c5-d9e6e1eca1c5 rack1
UN 192.168.50.146 534.18 KiB 256 ? 2248ea35-85d4-4887-820b-1fac4733021f rack1
UN 192.168.145.147 484.19 KiB 256 ? d988aaaa-44ae-4fec-a617-0b0a253e736d rack1
DN 192.168.145.144 481.53 KiB 256 ? c23703a1-6854-47a7-a4a2-af649d63af0c rack1
An extra node will appear in the cluster with an outdated IP address
(the IP of the terminated Cassandra pod) in the Down
state.
To work around the issue, after replacing the Tungsten Fabric controller node, delete the Cassandra pod on the replaced node and remove the outdated node from the Cassandra cluster using nodetool:
kubectl -n tf exec -it tf-cassandra-<CONFIG-OR-ANALYTICS>-dc1-rack1-<REPLICA-NUM> -c cassandra -- nodetool removenode <HOST-ID>
StackLight¶
[42463] KubePodsCrashLooping is firing during cluster update¶
During major or patch update of a MOSK cluster with StackLight enabled in
non-HA mode, the KubePodsCrashLooping
alert may be firing for the Grafana
ReplicaSet
.
Grafana relies on PostgreSQL for persistent data. In non-HA StackLight setup,
PostgreSQL becomes temporarily unavailable during updates. If Grafana loses its
database connection or fails to establish one during startup, Grafana fails
with an error. This may cause the Grafana pod to enter the CrashLoopBackOff
state. Such behavior is expected in non-HA StackLight setups. The Grafana pod
will resume normal operation after PostgreSQL is restored.
To prevent the issue, deploy StackLight in HA mode.
Container Cloud web UI¶
[53886] Moving a worker machine into maintainance fails¶
After moving a cluster into maintenance mode using the MOSK management console, an attempt to move a worker machine into maintenance mode fails with the Only undefined workers can be in maintainance mode at the same time.
As a work around, move worker machines into maintenance mode using CLI as described in Enable maintenance mode on a cluster and machine using CLI.
[50168] Inability to use a new project through the Container Cloud web UI¶
A newly created project does not display all available tabs and contains different access denied errors during first five minutes after creation.
To work around the issue, refresh the browser in five minutes after the project creation.
[50181] Failure to deploy a compact cluster using the Container Cloud web UI¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane: false
using the web UI.
To work around the issue, manually add the required labels using CLI. Once done, the cluster deployment resumes.