Replace a failed controller node
This section describes how to replace a failed control plane node in your
MOSK deployment. The procedure applies to the control plane
nodes that are, for example, permanently failed due to a hardware failure and
appear in the NotReady state:
kubectl get nodes <NODE-NAME>
Example of system response:
NAME STATUS ROLES AGE VERSION
<NODE-NAME> NotReady <none> 10d v1.18.8-mirantis-1
To replace a failed controller node:
Remove the Kubernetes labels from the failed node by editing the
.metadata.labelsnode object:kubectl edit node <NODE-NAME>
If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.
Add the control plane node to your deployment as described in Add a controller node.
Identify all stateful applications present on the failed node:
node=<NODE-NAME> claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}") for i in $claims; do echo $i; done
Example of system response:
mysql-data-mariadb-server-2 openstack-operator-bind-mounts-rfr-openstack-redis-1 etcd-data-etcd-etcd-0
If the failed controller node had the
StackLightlabel, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.Remove the OpenStack port related to the Octavia health manager pod of the failed node:
kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
Strongly recommended. Back up MKE as described in Create backups of Mirantis Kubernetes Engine.
Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.
Important
Because the MKE restoration process is complicated, we strongly recommend contacting Mirantis support for assistance.
If you still decide to restore MKE from a backup on your own, you must scale down
helm-controlleron the cluster being restored if the MKE version of the affected cluster after the restore will differ from the MKE version in theClusterReleaseobject that is set in MOSK Cluster objects in the management cluster:If you are restoring MKE on a management cluster: before starting the restore, scale down
helm-controlleron each affected MOSK cluster. This prevents unintended Ceph and OpenStack downgrades on MOSK clusters after the management cluster is restored.If you are restoring MKE on a MOSK cluster: immediately after the restore completes, scale down
helm-controller. Because the restore rolls the cluster back to an older release, this prevents it from triggering a premature upgrade of Helm releases.