Replace a failed TF controller node¶
If one of the Tungsten Fabric (TF) controller nodes has failed, follow this procedure to replace it with a new node.
To replace a TF controller node:
Note
Pods that belong to the failed node can stay in the Terminating
state.
If a failed node has
tfconfigdb=enabled
ortfanalyticsdb=enabled
, or both labels assigned to it, get and note down the IP addresses of the Cassandra pods that run on the node to be replaced:kubectl -n tf get pods -owide | grep 'tf-cassandra.*<FAILED-NODE-NAME>'
Delete the failed TF controller node from the Kubernetes cluster:
kubectl delete node <FAILED-TF-CONTROLLER-NODE-NAME>
Note
Once the failed node has been removed from the cluster, all pods that hanged in the
Terminating
state should be removed.Assign the TF labels for the new control plane node as per the table below using the following command:
kubectl label node <NODE-NAME> <LABEL-KEY=LABEL-VALUE> ...
Tungsten Fabric (TF) node roles¶ Node role
Description
Kubernetes labels
Minimal count
TF control plane
Hosts the TF control plane services such as
database
,messaging
,api
,svc
,config
.tfconfig=enabled
tfcontrol=enabled
tfwebui=enabled
tfconfigdb=enabled
3
TF analytics
Hosts the TF analytics services.
tfanalytics=enabled
tfanalyticsdb=enabled
3
TF vRouter
Hosts the TF vRouter module and vRouter Agent.
tfvrouter=enabled
Varies
TF vRouter DPDK Technical Preview
Hosts the TF vRouter Agent in DPDK mode.
tfvrouter-dpdk=enabled
Varies
Note
TF supports only Kubernetes OpenStack workloads. Therefore, you should label OpenStack compute nodes with the
tfvrouter=enabled
label.Note
Do not specify the
openstack-gateway=enabled
andopenvswitch=enabled
labels for the OpenStack deployments with TF as a networking back end.Once you label the new Kubernetes node, new pods start scheduling on the node. Though, pods that use Persistent Volume Claims are stuck in the
Pending
state as their volume claims stay bounded to the local volumes from the deleted node. To resolve the issue:Delete the PersistentVolumeClaim (PVC) bounded to the local volume from the failed node:
kubectl -n tf delete pvc <PVC-BOUNDED-TO-NON-EXISTING-VOLUME>
Note
Clustered services that use PVC, such as Cassandra, Kafka, and ZooKeeper, start the replication process when new pods move to the
Ready
state.Check the PersistenceVolumes (PVs) claimed by the deleted PVCs. If a PV is stuck in the
Released
state, delete it manually:kubectl -n tf delete pv <PV>
Delete the pod that is using the removed PVC:
kubectl -n tf delete pod <POD-NAME>
Verify that the pods have successfully started on the replaced controller node and stay in the
Ready
state.If the failed controller node had
tfconfigdb=enabled
ortfanalyticsdb=enabled
, or both labels assigned to it, remove old Cassandra hosts from theconfig
andanalytics
cluster configuration:Get the host ID of the removed Cassandra host using the pod IP addresses saved during Step 1:
kubectl -n tf exec tf-cassandra-<config/analytics>-dc1-rack1-1 -c cassandra -- nodetool status
Verify that the removed Cassandra node has the
DN
status that indicates that this node is currently offline.Remove the failed Cassandra host:
kubectl -n tf exec tf-cassandra-<config/analytics>-dc1-rack1-1 -c cassandra -- nodetool removenode <HOST-ID>
Delete terminated nodes from the TF configuration through the TF web UI:
Log in to the TF web UI.
Navigate to Configure > BGP Routers.
Delete all terminated control nodes.
Note
You can manage nodes of other types from Configure > Nodes.