Newer documentation is now live.You are currently reading an older version.

Replace a failed controller node

This section describes how to replace a failed control plane node in your MOSK deployment. The procedure applies to the control plane nodes that are, for example, permanently failed due to a hardware failure and appear in the NotReady state:

kubectl get nodes <NODE-NAME>

Example of system response:

NAME                         STATUS       ROLES    AGE   VERSION
<NODE-NAME>    NotReady   <none>   10d   v1.18.8-mirantis-1

To replace a failed controller node:

  1. Remove the Kubernetes labels from the failed node by editing the .metadata.labels node object:

    kubectl edit node <NODE-NAME>
    
  2. If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.

  3. Add the control plane node to your deployment as described in Add a controller node.

  4. Identify all stateful applications present on the failed node:

    node=<NODE-NAME>
    claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}")
    for i in $claims; do echo $i; done
    

    Example of system response:

    mysql-data-mariadb-server-2
    openstack-operator-bind-mounts-rfr-openstack-redis-1
    etcd-data-etcd-etcd-0
    
  5. If the failed controller node had the StackLight label, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.

  6. Remove the OpenStack port related to the Octavia health manager pod of the failed node:

    kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
    
  7. Skip since MOSK 25.2.1. Applies to 25.2 or earlier. For clouds using Open Virtual Network (OVN) as the networking backend, remove the Northbound and Southbound database members for the failed node:

    Removal of the Northbound and Southbound database members
    1. Log in to the running openvswitch-ovn-db-XX pod.

    2. Remove an old Northboud database member:

      1. Identify the member to be removed:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
        

        Example of system response:

        5d02
        Name: OVN_Northbound
        Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b)
        Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d)
        Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643
        Status: cluster member
        Role: follower
        Term: 5402
        Leader: c617
        Vote: c617
        
        Election timer: 10000
        Log: [22917, 26535]
        Entries not yet committed: 0
        Entries not yet applied: 0
        Connections: ->c617 ->4d1e <-c617 <-0e28
        Disconnections: 0
        Servers:
            c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 1153 ms ago
            4d1e (4d1e at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643)
            0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 109828 ms ago
            5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
        

        In the above example output, the 4d1e member belongs to the failed node.

      2. Remove the old member:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 4d1e
        sent removal request to leader
        
      3. Verify that the old member has been removed successfully:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
        

        Example of a successful system response:

        5d02
        Name: OVN_Northbound
        Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b)
        Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d)
        Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643
        Status: cluster member
        Role: follower
        Term: 5402
        Leader: c617
        Vote: c617
        
        Election timer: 10000
        Log: [22917, 26536]
        Entries not yet committed: 0
        Entries not yet applied: 0
        Connections: ->c617 <-c617 <-0e28 ->0e28
        Disconnections: 1
        Servers:
            c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 3321 ms ago
            0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 134877 ms ago
            5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
        
    3. Remove an old Southbound database member by following the same steps used to remove an old Northbound database member:

      1. Identify the member to be removed:

        ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
        
      2. Remove the old member:

        ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound <SERVER-ID>
        
  8. Strongly recommended. Back up MKE as described in Mirantis Kubernetes Engine documentation: Back up MKE.

    Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.

    Important

    Because the MKE restoration process is complicated, we strongly recommend contacting Mirantis support for assistance.

    If you still decide to restore MKE from a backup on your own, you must scale down helm-controller on the cluster being restored if the MKE version of the affected cluster after the restore will differ from the MKE version in the ClusterRelease object that is set in MOSK Cluster objects in the management cluster:

    • If you are restoring MKE on a management cluster: before starting the restore, scale down helm-controller on each affected MOSK cluster. This prevents unintended Ceph and OpenStack downgrades on MOSK clusters after the management cluster is restored.

    • If you are restoring MKE on a MOSK cluster: immediately after the restore completes, scale down helm-controller. Because the restore rolls the cluster back to an older release, this prevents it from triggering a premature upgrade of Helm releases.