Replace a failed manager node

This section describes how to replace a failed manager node in both MOSK management and MOSK clusters. The procedure applies to the manager nodes that are, for example, permanently failed due to a hardware failure and remain in the NotReady state.

Caution

If your MOSK cluster is deployed with a compact control plane, follow the Replace a failed controller node procedure.

To replace a failed manager node:

Verify that the affected manager node is in the NotReady state:

kubectl get nodes <NODE-NAME>

Example of system response:

NAME                          STATUS     ROLES    AGE   VERSION
<NODE-NAME>   NotReady   <none>   10d   v1.18.8-mirantis-1

Delete the affected manager node as described in Delete a cluster machine.
Add a manager node as described in Add a machine.
Strongly recommended. Back up MKE as described in Mirantis Kubernetes Engine documentation: Back up MKE.

Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.
Important

Because the MKE restoration process is complicated, we strongly recommend contacting Mirantis support for assistance.

If you still decide to restore MKE from a backup on your own, you must scale down helm-controller on the cluster being restored if the MKE version of the affected cluster after the restore will differ from the MKE version in the ClusterRelease object that is set in MOSK Cluster objects in the management cluster:
- If you are restoring MKE on a management cluster: before starting the restore, scale down helm-controller on each affected MOSK cluster. This prevents unintended Ceph and OpenStack downgrades on MOSK clusters after the management cluster is restored.
- If you are restoring MKE on a MOSK cluster: immediately after the restore completes, scale down helm-controller. Because the restore rolls the cluster back to an older release, this prevents it from triggering a premature upgrade of Helm releases.