Replace a failed controller node

This section describes how to replace a failed control plane node in your MOSK deployment. The procedure applies to the control plane nodes that are, for example, permanently failed due to a hardware failure and appear in the NotReady state:

kubectl get nodes <NODE-NAME>

Example of system response:

NAME                         STATUS       ROLES    AGE   VERSION
<NODE-NAME>    NotReady   <none>   10d   v1.18.8-mirantis-1

To replace a failed controller node:

  1. Remove the Kubernetes labels from the failed node by editing the .metadata.labels node object:

    kubectl edit node <NODE-NAME>
    
  2. If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.

  3. Add the control plane node to your deployment as described in Add a controller node.

  4. Identify all stateful applications present on the failed node:

    node=<NODE-NAME>
    claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}")
    for i in $claims; do echo $i; done
    

    Example of system response:

    mysql-data-mariadb-server-2
    openstack-operator-bind-mounts-rfr-openstack-redis-1
    etcd-data-etcd-etcd-0
    
  5. If the failed controller node had the StackLight label, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.

  6. Remove the OpenStack port related to the Octavia health manager pod of the failed node:

    kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
    
  7. Strongly recommended. Back up MKE as described in Create backups of Mirantis Kubernetes Engine.

    Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.