Replace a failed controller node¶
This section describes how to replace a failed control plane node in your
MOSK deployment. The procedure applies to the control plane
nodes that are, for example, permanently failed due to a hardware failure and
appear in the NotReady state:
kubectl get nodes <NODE-NAME>
Example of system response:
NAME STATUS ROLES AGE VERSION
<NODE-NAME> NotReady <none> 10d v1.18.8-mirantis-1
To replace a failed controller node:
Remove the Kubernetes labels from the failed node by editing the
.metadata.labelsnode object:kubectl edit node <NODE-NAME>
If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.
Add the control plane node to your deployment as described in Add a controller node.
Identify all stateful applications present on the failed node:
node=<NODE-NAME> claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}") for i in $claims; do echo $i; done
Example of system response:
mysql-data-mariadb-server-2 openstack-operator-bind-mounts-rfr-openstack-redis-1 etcd-data-etcd-etcd-0
If the failed controller node had the
StackLightlabel, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.Remove the OpenStack port related to the Octavia health manager pod of the failed node:
kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
Strongly recommended. Back up MKE as described in Create backups of Mirantis Kubernetes Engine.
Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.