Replace a failed controller node

This section describes how to replace a failed control plane node in your MOS deployment. The procedure applies to the control plane nodes that are, for example, permanently failed due to a hardware failure and appear in the NotReady state:

kubectl get nodes <CONTAINER-CLOUD-NODE-NAME>

Example of system response:

NAME                         STATUS       ROLES    AGE   VERSION
<CONTAINER-CLOUD-NODE-NAME>    NotReady   <none>   10d   v1.18.8-mirantis-1

To replace a failed controller node:

  1. Remove the Kubernetes labels from the failed node by editing the .metadata.labels node object:

    kubectl edit node <CONTAINER-CLOUD-NODE-NAME>
    
  2. Add the control plane node to your deployment as described in Add a controller node.

  3. Identify all stateful applications present on the failed node:

    node=<CONTAINER-CLOUD-NODE-NAME>
    claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}")
    for i in $claims; do echo $i; done
    

    Example of system response:

    mysql-data-mariadb-server-2
    openstack-operator-bind-mounts-rfr-openstack-redis-1
    etcd-data-etcd-etcd-0
    
  4. Reschedule stateful applications pods to healthy controller nodes as described in Reschedule stateful applications.

  5. If the failed controller node had the StackLight label, fix the StackLight volume node affinity conflict as described in Mirantis Container Cloud Operations Guide: Delete a machine.

  6. Remove the OpenStack port related to the Octavia health manager pod of the failed node:

    kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>