Known issues

This section lists MOSK known issues with workarounds for the MOSK release 24.2.2:

OpenStack

[31186,34132] Pods get stuck during MariaDB operations

Due to the upstream MariaDB issue, during MariaDB operations on a management cluster, Pods may get stuck in continuous restarts with the following example error:

[ERROR] WSREP: Corrupt buffer header: \
addr: 0x7faec6f8e518, \
seqno: 3185219421952815104, \
size: 909455917, \
ctx: 0x557094f65038, \
flags: 11577. store: 49, \
type: 49

Workaround:

  1. Create a backup of the /var/lib/mysql directory on the mariadb-server Pod.

  2. Verify that other replicas are up and ready.

  3. Remove the galera.cache file for the affected mariadb-server Pod.

  4. Remove the affected mariadb-server Pod or wait until it is automatically restarted.

After Kubernetes restarts the Pod, the Pod clones the database in 1-2 minutes and restores the quorum.

[43058] [Antelope] Cronjob for MariaDB is not created

Sometimes, after changing the OpenStackDeployment custom resource, it does not transition to the APPLYING state as expected.

To work around the issue, restart the openstack-controller pod in the osh-system namespace.

Tungsten Fabric

[13755] TF pods switch to CrashLoopBackOff after a simultaneous reboot

Rebooting all Cassandra cluster TFConfig or TFAnalytics nodes, maintenance, or other circumstances that cause the Cassandra pods to start simultaneously may cause a broken Cassandra TFConfig and/or TFAnalytics cluster. In this case, Cassandra nodes do not join the ring and do not update the IPs of the neighbor nodes. As a result, the TF services cannot operate Cassandra cluster(s).

To verify that a Cassandra cluster is affected:

Run the nodetool status command specifying the config or analytics cluster and the replica number:

kubectl -n tf exec -it tf-cassandra-<config/analytics>-dc1-rack1-<replica number> -c cassandra -- nodetool status

Example of system response with outdated IP addresses:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
DN  <outdated ip>   ?          256          64.9%             a58343d0-1e3f-4d54-bcdf-9b9b949ca873  r1
DN  <outdated ip>   ?          256          69.8%             67f1d07c-8b13-4482-a2f1-77fa34e90d48  r1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens       Owns (effective)  Host ID                               Rack
UN  <actual ip>      3.84 GiB   256          65.2%             7324ebc4-577a-425f-b3de-96faac95a331  rack1

Workaround:

Manually delete the Cassandra pod from the failed config or analytics cluster to re-initiate the bootstrap process for one of the Cassandra nodes:

kubectl -n tf delete pod tf-cassandra-<config/analytics>-dc1-rack1-<replica_num>

[40032] tf-rabbitmq fails to start after rolling reboot

Occasionally, RabbitMQ instances in tf-rabbitmq pods fail to enable the tracking_records_in_ets during the initialization process.

To work around the problem, restart the affected pods manually.

[46220] ClusterMaintenanceRequest stuck with Tungsten Fabric API v2

Fixed in 24.3

On clusters running Tungsten Fabric with API v2, after updating from MOSK 24.2 to 24.2.1, subsequent cluster maintenance requests may stuck. The root cause of the issue is a version mismatch within the internal structures of the Tungsten Fabric Operator.

To identify if your cluster is affected, run:

kubectl get clusterworkloadlock tf-openstack-tf -o yaml

The output similar to the one below, indicates that the Tungsten Fabric ClusterWorkloadLock remains in the active state indefinitely preventing further LCM operations with other components:

apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterWorkloadLock
metadata:
  creationTimestamp: "2024-08-30T13:50:33Z"
  generation: 1
  name: tf-openstack-tf
  resourceVersion: "4414649"
  uid: 582fc558-c343-4e96-a445-a2d1818dcdb2
spec:
  controllerName: tungstenfabric
status:
  errorMessage: cluster is not in ready state
  release: 17.2.4+24.2.2
  state: active

Additionally, the LCM controller logs may contain errors similar to:

{"level":"info","ts":"2024-09-02T16:22:16Z","logger":"entrypoint.lcmcluster-controller.req:5520","caller":"lcmcluster/maintenance.go:178","msg":"ClusterWorkloadLock is inactive cwl {{ClusterWorkloadLock lcm.mirantis.com/v1alpha1} {ceph-clusterworkloadlock    a45eca91-cd7b-4d68-9a8e-4d656b4308af 3383288 1 2024-08-30 13:15:14 +0000 UTC <nil> <nil> map[] map[miraceph-ready:true] [{v1 Namespace ceph-lcm-mirantis 43853f67-9058-44ed-8287-f650dbeac5d7 <nil> <nil>}]
[] [{ceph-controller Update lcm.mirantis.com/v1alpha1 2024-08-30 13:25:53 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:miraceph-ready\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"43853f67-9058-44ed-8287-f650dbeac5d7\\\"}\":{}}},\"f:spec\":{\".\":{},\"f:controllerName\":{}}} } {ceph-controller Update lcm.mirantis.com/v1alpha1 2024-09-02 10:48:27 +0000 UTC FieldsV1 {\"f:status\":{\".\":{},\"f:release\":{},\"f:state\":{}}} status}]} {ceph} {inactive  17.2.4+24.2.2}}","ns":"child-ns-tf","name":"child-cl"}
{"level":"info","ts":"2024-09-02T16:22:16Z","logger":"entrypoint.lcmcluster-controller.req:5520","caller":"lcmcluster/maintenance.go:178","msg":"ClusterWorkloadLock is inactive cwl {{ClusterWorkloadLock lcm.mirantis.com/v1alpha1} {openstack-osh-dev    7de2b86f-d247-4cee-be8d-dcbcf5e1e11b 3382535 1 2024-08-30 13:50:54 +0000 UTC <nil> <nil> map[] map[] [] [] [{pykube-ng Update lcm.mirantis.com/v1alpha1 2024-08-30 13:50:54 +0000 UTC FieldsV1 {\"f:spec\":{\".\":{},\"f:controllerName\":{}}} } {pykube-ng Update lcm.mirantis.com/v1alpha1 2024-09-02 10:47:29 +0000 UTC FieldsV1 {\"f:status\":{\".\":{},\"f:release\":{},\"f:state\":{}}} status}]} {openstack} {inactive  17.2.4+24.2.2}}","ns":"child-ns-tf","name":"child-cl"}
{"level":"info","ts":"2024-09-02T16:22:16Z","logger":"entrypoint.lcmcluster-controller.req:5520","caller":"lcmcluster/maintenance.go:173","msg":"ClusterWorkloadLock is still active cwl {{ClusterWorkloadLock lcm.mirantis.com/v1alpha1} {tf-openstack-tf    582fc558-c343-4e96-a445-a2d1818dcdb2 3382495 1 2024-08-30 13:50:33 +0000 UTC <nil> <nil> map[] map[] [] [] [{maintenance-ctl Update lcm.mirantis.com/v1alpha1 2024-08-30 13:50:33 +0000 UTC FieldsV1 {\"f:spec\":{\".\":{},\"f:controllerName\":{}}} } {maintenance-ctl Update lcm.mirantis.com/v1alpha1 2024-09-02 10:47:25 +0000 UTC FieldsV1 {\"f:status\":{\".\":{},\"f:errorMessage\":{},\"f:release\":{},\"f:state\":{}}} status}]} {tungstenfabric} {active cluster is not in ready state 17.2.4+24.2.2}}","ns":"child-ns-tf","name":"child-cl"}
{"level":"error","ts":"2024-09-02T16:22:16Z","logger":"entrypoint.lcmcluster-controller.req:5520","caller":"lcmcluster/lcmcluster_controller.go:388","msg":"","ns":"child-ns-tf","name":"child-cl","error":"following ClusterWorkloadLocks in cluster child-ns-tf/child-cl are still active -  tf-openstack-tf: InProgress not all ClusterWorkloadLocks are inactive yet","stacktrace":"sigs.k8s.io/cluster-api-provider-openstack/pkg/lcm/controller/lcmcluster.(*ReconcileLCMCluster).updateCluster\n\t/go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/lcm/controller/lcmcluster/lcmcluster_controller.go:388\nsigs.k8s.io/cluster-api-provider-openstack/pkg/lcm/controller/lcmcluster.(*ReconcileLCMCluster).Reconcile\n\t/go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/lcm/controller/lcmcluster/lcmcluster_controller.go:223\nsigs.k8s.io/cluster-api-provider-openstack/pkg/service.(*reconcilePanicCatcher).Reconcile\n\t/go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/service/reconcile.go:98\nsigs.k8s.io/cluster-api-provider-openstack/pkg/service.(*reconcileContextEnricher).Reconcile\n\t/go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/service/reconcile.go:78\nsigs.k8s.io/cluster-api-provider-openstack/pkg/service.(*reconcileMetrics).Reconcile\n\t/go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/service/reconcile.go:136\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.3/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.3/pkg/internal/controller/controller.go:31:

To work around the issue, set the actual version of Tungsten Fabric Operator in the TFOperator custom resource:

  • For MOSK 24.2.1:

    kubectl -n tf patch tfoperators.tf.mirantis.com openstack-tf --type=merge --subresource status --patch 'status: {operatorVersion: <0.15.5>}'
    
  • For MOSK 24.2.2:

    kubectl -n tf patch tfoperators.tf.mirantis.com openstack-tf --type=merge --subresource status --patch 'status: {operatorVersion: <0.15.6>}'
    

Update known issues

[42449] Rolling reboot failure on a Tungsten Fabric cluster

During cluster update, the rolling reboot fails on the Tungsten Fabric cluster. To work around the issue, restart the RabbitMQ pods in the Tungsten Fabric cluster.

[46671] Cluster update fails with the tf-config pods crashed

When updating to the MOSK 24.3 series, tf-config pods from the Tungsten Fabric namespace may enter the CrashLoopBackOff state. For example:

tf-config-cs8zr                            2/5     CrashLoopBackOff   676 (19s ago)   15h
tf-config-db-6zxgg                         1/1     Running            44 (25m ago)    15h
tf-config-db-7k5sz                         1/1     Running            43 (23m ago)    15h
tf-config-db-dlwdv                         1/1     Running            43 (25m ago)    15h
tf-config-nw4tr                            3/5     CrashLoopBackOff   665 (43s ago)   15h
tf-config-wzf6c                            1/5     CrashLoopBackOff   680 (10s ago)   15h
tf-control-c6bnn                           3/4     Running            41 (23m ago)    13h
tf-control-gsnnp                           3/4     Running            42 (23m ago)    13h
tf-control-sj6fd                           3/4     Running            41 (23m ago)    13h

To troubleshoot the issue, check the logs inside the tf-config API container and the tf-cassandra pods. The following example logs indicate that Cassandra services failed to peer with each other and are operating independently:

  • Logs from the tf-config API container:

    NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 192.168.200.23:9042 dc1>: Unavailable('Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level QUORUM" info={\'required_replicas\': 2, \'alive_replicas\': 1, \'consistency\': \'QUORUM\'}',)})
    
  • Logs from the tf-cassandra pods:

    INFO  [OptionalTasks:1] 2024-09-09 08:59:36,231 CassandraRoleManager.java:419 - Setup task failed with error, rescheduling
    WARN  [OptionalTasks:1] 2024-09-09 08:59:46,231 CassandraRoleManager.java:379 - CassandraRoleManager skipped default role setup: some nodes were not ready
    

To work around the issue, restart the Cassandra services in the Tungsten Fabric namespace by deleting the affected pods sequentially to establish the connection between them:

kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-0
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-1
kubectl -n tf delete pod tf-cassandra-config-dc1-rack1-2

Now, all other services in the Tungsten Fabric namespace should be in the Active state.