Replace a failed controller node¶
This section describes how to replace a failed control plane node in your
MOSK deployment. The procedure applies to the control plane
nodes that are, for example, permanently failed due to a hardware failure and
appear in the NotReady
state:
kubectl get nodes <CONTAINER-CLOUD-NODE-NAME>
Example of system response:
NAME STATUS ROLES AGE VERSION
<CONTAINER-CLOUD-NODE-NAME> NotReady <none> 10d v1.18.8-mirantis-1
To replace a failed controller node:
Remove the Kubernetes labels from the failed node by editing the
.metadata.labels
node object:kubectl edit node <CONTAINER-CLOUD-NODE-NAME>
If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.
Add the control plane node to your deployment as described in Add a controller node.
Identify all stateful applications present on the failed node:
node=<CONTAINER-CLOUD-NODE-NAME> claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}") for i in $claims; do echo $i; done
Example of system response:
mysql-data-mariadb-server-2 openstack-operator-bind-mounts-rfr-openstack-redis-1 etcd-data-etcd-etcd-0
For MOSK 23.3 series or earlier, reschedule stateful applications pods to healthy controller nodes as described in Reschedule stateful applications. For the newer versions, MOSK performs the rescheduling of stateful applications automatically.
If the failed controller node had the
StackLight
label, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.Remove the OpenStack port related to the Octavia health manager pod of the failed node:
kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
For clouds using Open Virtual Network (OVN) as the networking backend, remove the Northbound and Southbound database members for the failed node:
Log in to the running
openvswitch-ovn-db-XX
pod.Remove an old Northboud database member:
Identify the member to be removed:
ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
Example of system response:
5d02 Name: OVN_Northbound Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b) Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d) Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643 Status: cluster member Role: follower Term: 5402 Leader: c617 Vote: c617 Election timer: 10000 Log: [22917, 26535] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->c617 ->4d1e <-c617 <-0e28 Disconnections: 0 Servers: c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 1153 ms ago 4d1e (4d1e at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) 0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 109828 ms ago 5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
In the above example output, the
4d1e
member belongs to the failed node.Remove the old member:
ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 4d1e sent removal request to leader
Verify that the old member has been removed successfully:
ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
Example of a successful system response:
5d02 Name: OVN_Northbound Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b) Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d) Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643 Status: cluster member Role: follower Term: 5402 Leader: c617 Vote: c617 Election timer: 10000 Log: [22917, 26536] Entries not yet committed: 0 Entries not yet applied: 0 Connections: ->c617 <-c617 <-0e28 ->0e28 Disconnections: 1 Servers: c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 3321 ms ago 0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 134877 ms ago 5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
Remove an old Southbound database member by following the same steps used to remove an old Northbound database member:
Identify the member to be removed:
ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
Remove the old member:
ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound <SERVER-ID>