If one of the Tungsten Fabric (TF) controller nodes has failed, follow this
procedure to replace it with a new node.
To replace a TF controller node:
Note
Pods that belong to the failed node can stay in the Terminating
state.
If a failed node has tfconfigdb=enabled or tfanalyticsdb=enabled,
or both labels assigned to it, get and note down the IP addresses of
the Cassandra pods that run on the node to be replaced:
curl-s-H"X-Auth-Token: $(openstacktokenissue|awk'/ id / {print $4}')"http://tf-config-api.tf.svc:8082/config-nodes|jq
Remove the failed config node:
curl-s-H"X-Auth-Token: $(openstacktokenissue|awk'/ id / {print $4}')"-X"DELETE"<LINK_FROM_HREF_WITH_NODE_UUID>
Obtain the list of config-database and control nodes:
curl-s-H"X-Auth-Token: $(openstacktokenissue|awk'/ id / {print $4}')"http://tf-config-api.tf.svc:8082/config-database-nodes|jq
curl-s-H"X-Auth-Token: $(openstacktokenissue|awk'/ id / {print $4}')"http://tf-config-api.tf.svc:8082/control-nodes|jq
Identify the config-database and control nodes to be deleted
using the href field from the system output from the previous step.
Delete the nodes as required:
curl-s-H"X-Auth-Token: $(openstacktokenissue|awk'/ id / {print $4}')"-X"DELETE"<LINK_FROM_HREF_WITH_NODE_UUID>
Hosts the TF control plane services such as database,
messaging, api, svc, config.
tfconfig=enabled
tfcontrol=enabled
tfwebui=enabled
tfconfigdb=enabled
3
TF analytics
Hosts the TF analytics services.
tfanalytics=enabled
tfanalyticsdb=enabled
3
TF vRouter
Hosts the TF vRouter module and vRouter Agent.
tfvrouter=enabled
Varies
TF vRouter DPDK Technical Preview
Hosts the TF vRouter Agent in DPDK mode.
tfvrouter-dpdk=enabled
Varies
Note
TF supports only Kubernetes OpenStack workloads.
Therefore, you should label OpenStack compute nodes with
the tfvrouter=enabled label.
Note
Do not specify the openstack-gateway=enabled
and openvswitch=enabled labels for the OpenStack deployments with TF
as a networking backend.
Once you label the new Kubernetes node, new pods start scheduling on the
node. Though, pods that use Persistent Volume Claims are stuck in the
Pending state as their volume claims stay bounded to the local volumes
from the deleted node. To resolve the issue:
Delete the PersistentVolumeClaim (PVC) bounded to the local volume
from the failed node:
Clustered services that use PVC, such as Cassandra, Kafka,
and ZooKeeper, start the replication process when new pods move
to the Ready state.
Check the PersistenceVolumes (PVs) claimed by the deleted PVCs.
If a PV is stuck in the Released state, delete it manually:
kubectl-ntfdeletepv<PV>
Delete the pod that is using the removed PVC:
kubectl-ntfdeletepod<POD-NAME>
Verify that the pods have successfully started on the replaced controller
node and stay in the Ready state.
If the failed controller node had tfconfigdb=enabled or
tfanalyticsdb=enabled, or both labels assigned to it, remove old
Cassandra hosts from the config and analytics cluster configuration:
Get the host ID of the removed Cassandra host using the pod IP addresses
saved during Step 1: