Replace a failed TF controller node

If one of the Tungsten Fabric (TF) controller nodes has failed, follow this procedure to replace it with a new node.

To replace a TF controller node:

Note

Pods that belong to the failed node can stay in the Terminating state.

  1. If a failed node has tfconfigdb=enabled or tfanalyticsdb=enabled, or both labels assigned to it, get and note down the IP addresses of the Cassandra pods that run on the node to be replaced:

    kubectl -n tf get pods -owide | grep 'tf-cassandra.*<FAILED-NODE-NAME>'
    
  2. Delete the failed TF controller node from the Kubernetes cluster using the Mirantis Container Cloud web UI or CLI. For the procedure, refer to Mirantis Container Cloud Operations Guide: Delete a cluster machine.

    Note

    Once the failed node has been removed from the cluster, all pods that hanged in the Terminating state should be removed.

  3. Remove the control (with the BGP router), config, and config-db nodes from the TF configuration database:

    To remove control (with the BGP router) nodes:

    1. Log in to the TF web UI.

    2. Navigate to Configure > BGP Routers.

    3. Delete all terminated control nodes.

    To remove other type nodes:

    1. Log in to the TF web UI.

    2. Navigate to Configure > Nodes.

    3. Delete the terminated nodes.

    1. Log in to the Keystone client container:

      kubectl -n openstack exec -it deployment/keystone-client -- bash
      
    2. Get a link for the failed TF config node:

      curl -s -H "X-Auth-Token: $(openstack token issue | awk '/ id / {print $4}')" http://tf-config-api.tf.svc:8082/config-nodes | jq
      
    3. Remove the failed config node:

      curl -s -H "X-Auth-Token: $(openstack token issue | awk '/ id / {print $4}')" -X "DELETE" <LINK_FROM_HREF_WITH_NODE_UUID>
      
    4. Obtain the list of config-database and control nodes:

      curl -s -H "X-Auth-Token: $(openstack token issue | awk '/ id / {print $4}')" http://tf-config-api.tf.svc:8082/config-database-nodes | jq
      curl -s -H "X-Auth-Token: $(openstack token issue | awk '/ id / {print $4}')" http://tf-config-api.tf.svc:8082/control-nodes | jq
      
    5. Identify the config-database and control nodes to be deleted using the href field from the system output from the previous step. Delete the nodes as required:

      curl -s -H "X-Auth-Token: $(openstack token issue | awk '/ id / {print $4}')" -X "DELETE" <LINK_FROM_HREF_WITH_NODE_UUID>
      
    1. Enable the tf-api-cli Deployment as described in Enable tf-api-cli.

    2. Log in to the tf-api-cli container:

      kubectl -n tf exec -it deployment/tf-tool-cli -- bash
      
    3. Obtain the list of the Tungsten Fabric config nodes and identify the one to be removed using the cat argument and checking the name field:

      tf-api-cli ls config-node
      tf-api-cli cat config-node/<UUID>
      
    4. Remove the config node, confirm the removal command:

      tf-api-cli rm config-node/<UUID>
      
    5. Obtain the list of the Tungsten Fabric config-database nodes and identify the one to be removed using the cat argument and checking the name field:

      tf-api-cli ls config-database-node
      tf-api-cli cat config-database-node/<UUID>
      
    6. Remove the config-database node, confirm the removal command:

      tf-api-cli rm config-database-node/<UUID>
      
    7. Obtain the list of the BGP routers and identify the one to be removed using the cat argument and checking the name field:

      tf-api-cli ls bgp-router
      tf-api-cli cat bgp-router/<UUID>
      
    8. Remove the Tugsten Fabric control node with the BGP router, confirm the removal command:

      tf-api-cli rm bgp-router/<UUID>
      
  4. Assign the TF labels for the new control plane node as per the table below using the following command:

    kubectl label node <NODE-NAME> <LABEL-KEY=LABEL-VALUE> ...
    
    Tungsten Fabric (TF) node roles

    Node role

    Description

    Kubernetes labels

    Minimal count

    TF control plane

    Hosts the TF control plane services such as database, messaging, api, svc, config.

    tfconfig=enabled
    tfcontrol=enabled
    tfwebui=enabled
    tfconfigdb=enabled

    3

    TF analytics

    Hosts the TF analytics services.

    tfanalytics=enabled
    tfanalyticsdb=enabled

    3

    TF vRouter

    Hosts the TF vRouter module and vRouter Agent.

    tfvrouter=enabled

    Varies

    TF vRouter DPDK Technical Preview

    Hosts the TF vRouter Agent in DPDK mode.

    tfvrouter-dpdk=enabled

    Varies

    Note

    TF supports only Kubernetes OpenStack workloads. Therefore, you should label OpenStack compute nodes with the tfvrouter=enabled label.

    Note

    Do not specify the openstack-gateway=enabled and openvswitch=enabled labels for the OpenStack deployments with TF as a networking backend.

  5. Once you label the new Kubernetes node, new pods start scheduling on the node. Though, pods that use Persistent Volume Claims are stuck in the Pending state as their volume claims stay bounded to the local volumes from the deleted node. To resolve the issue:

    1. Delete the PersistentVolumeClaim (PVC) bounded to the local volume from the failed node:

      kubectl -n tf delete pvc <PVC-BOUNDED-TO-NON-EXISTING-VOLUME>
      

      Note

      Clustered services that use PVC, such as Cassandra, Kafka, and ZooKeeper, start the replication process when new pods move to the Ready state.

    2. Check the PersistenceVolumes (PVs) claimed by the deleted PVCs. If a PV is stuck in the Released state, delete it manually:

      kubectl -n tf delete pv <PV>
      
    3. Delete the pod that is using the removed PVC:

      kubectl -n tf delete pod <POD-NAME>
      
  6. Verify that the pods have successfully started on the replaced controller node and stay in the Ready state.

  7. If the failed controller node had tfconfigdb=enabled or tfanalyticsdb=enabled, or both labels assigned to it, remove old Cassandra hosts from the config and analytics cluster configuration:

    1. Get the host ID of the removed Cassandra host using the pod IP addresses saved during Step 1:

      kubectl -n tf exec tf-cassandra-<config/analytics>-dc1-rack1-1 -c cassandra -- nodetool status
      
    2. Verify that the removed Cassandra node has the DN status that indicates that this node is currently offline.

    3. Remove the failed Cassandra host:

      kubectl -n tf exec tf-cassandra-<config/analytics>-dc1-rack1-1 -c cassandra -- nodetool removenode <HOST-ID>