Replace a failed controller node

Replace a failed controller node

This section describes how to replace a failed control plane node in your MOS deployment. The procedure applies to the control plane nodes that are, for example, permanently failed due to a hardware failure and appear in the NotReady state:

kubectl get nodes <CONTAINER-CLOUD-NODE-NAME>

Example of system response:

NAME                         STATUS       ROLES    AGE   VERSION
<CONTAINER-CLOUD-NODE-NAME>    NotReady   <none>   10d   v1.18.8-mirantis-1

To replace a failed controller node:

  1. Remove the Kubernetes labels from the failed node by editing the .metadata.labels node object:

    kubectl edit node <CONTAINER-CLOUD-NODE-NAME>
    
  2. Add the control plane node to your deployment as described in Add a controller node.

  3. Identify all stateful applications present on the failed node:

    node=<CONTAINER-CLOUD-NODE-NAME>
    claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}")
    for i in $claims; do echo $i; done
    

    Example of system response:

    mysql-data-mariadb-server-2
    openstack-operator-bind-mounts-rfr-openstack-redis-1
    etcd-data-etcd-etcd-0
    
  4. Reschedule stateful applications pods to healthy controller nodes as described in Reschedule stateful applications.

  5. Remove the OpenStack port related to the Octavia health manager pod of the failed node:

    kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>