Node leaves the cluster after IP address change¶
A vSphere-based management cluster bootstrap fails due to a node leaving the cluster after an accidental IP address change.
The issue affects a vSphere-based cluster only when IPAM is not enabled and IP addresses assignment to the vSphere virtual machines is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM
dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when
dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after the IP of the VM has been changed. Therefore, such issue may lead to a VM leaving the cluster.
One of the nodes is in the
kubectl get nodes -o wide docker node ls
The UCP Swarm manager logs on the healthy manager node contain the following example error:
docker logs -f ucp-swarm-manager level=debug msg="Engine refresh failed" id="<docker node ID>|<node IP>: 12376"
If the affected node is manager:
The output of the docker info command contains the following example error:
Error: rpc error: code = Unknown desc = The swarm does not have a leader. \ It's possible that too few managers are online. \ Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
docker logs -f ucp-controller "warning","msg":"Node State Active check error: \ Swarm Mode Manager health check error: \ info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \ Is the docker daemon running?
On the affected node, the IP address on the first interface
eth0does not match the IP address configured in Docker. Verify the
Node Addressfield in the output of the docker info command.
The following lines are present in
dhclient[<pid>]: bound to <node IP> -- renewal in 1530 seconds
If there are several lines where the IP is different, the node is affected.
Apply the issue resolution¶
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server for the dedicated vSphere network. In this case, VMs receive only specified IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server for the dedicated vSphere network and configure the first interface
eth0on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled for new machines to be created and IPs to be assigned properly.