Recover from losing the quorum

Swarms are resilient to failures and can recover from temporary node failures, such as machine reboots and restart crashes, and other transient errors. However, if a swarm loses quorum, it cannot automatically recover. In such cases, tasks on existing worker nodes continue to run, but it is not possible to perform administrative tasks, such as scaling or updating services and joining or removing nodes from the swarm. The best way to recover after losing quorum is to bring the missing manager nodes back online. If that is not possible, follow the instructions below.

In a swarm of N managers, a majority (quorum) of manager nodes must always be available. For example, in a swarm with 5 managers, a minimum of 3 managers must be operational and in communication with each other. In other words, the swarm can tolerate up to (N-1)/2 permanent failures, and beyond that, requests involving swarm management cannot be processed. Such permanent failures include data corruption and hardware failure.

If you lose a quorum of managers, you cannot administer the swarm. If you have lost the quorum and you attempt to perform any management operation on the swarm, MKE issues the following error:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

To recover from losing quorum:

If you cannot recover from losing quorum by bringing the failed nodes back online, you must run the docker swarm init command with the --force-new-cluster flag from a manager node. Using this flag removes all managers except the manager from which the command was run.

  1. Run --force-new-cluster from the manager node you want to recover:

    docker swarm init --force-new-cluster --advertise-addr node01:2377
    
  2. Promote nodes to become managers until you have the required number of manager nodes.

The Mirantis Container Runtime where you run the command becomes the manager node of a single-node swarm, which is capable of managing and running services. The manager has all the previous information about services and tasks, worker nodes continue to be part of the swarm, and services continue running. You need to add or re-add manager nodes to achieve your previous task distribution and ensure that you have enough managers to maintain high availability and prevent losing the quorum.