Repair a single replica¶

When one or more MSR replicas are unhealthy but the overall majority (n/2 + 1) is healthy and able to communicate with one another, your MSR cluster is still functional and healthy.

Given that the MSR cluster is healthy, there’s no need to execute any disaster recovery procedures like restoring from a backup.

Instead, you should:

Remove the unhealthy replicas from the MSR cluster.
Join new replicas to make MSR highly available.

Since a MSR cluster requires a majority of replicas to be healthy at all times, the order of these operations is important. If you join more replicas before removing the ones that are unhealthy, your MSR cluster might become unhealthy.

Split-brain scenario¶

To understand why you should remove unhealthy replicas before joining new ones, imagine you have a five-replica MSR deployment, and something goes wrong with the overlay network connection the replicas, causing them to be separated in two groups.

Because the cluster originally had five replicas, it can work as long as three replicas are still healthy and able to communicate (5 / 2 + 1 = 3). Even though the network separated the replicas in two groups, MSR is still healthy.

If at this point you join a new replica instead of fixing the network problem or removing the two replicas that got isolated from the rest, it’s possible that the new replica ends up in the side of the network partition that has less replicas.

When this happens, both groups now have the minimum amount of replicas needed to establish a cluster. This is also known as a split-brain scenario, because both groups can now accept writes and their histories start diverging, making the two groups effectively two different clusters.

Remove replicas¶

To remove unhealthy replicas, you’ll first have to find the replica ID of one of the replicas you want to keep, and the replica IDs of the unhealthy replicas you want to remove.

You can find the list of replicas by navigating to Shared Resources > Stacks or Swarm > Volumes (when using swarm mode) on the MKE web interface, or by using the MKE client bundle to run:

docker ps --format "{{.Names}}" | grep dtr

# The list of MSR containers with <node>/<component>-<replicaID>, e.g.
# node-1/dtr-api-a1640e1c15b6

Another way to determine the replica ID is to SSH into a MSR node and run the following:

REPLICA_ID=$(docker inspect -f '{{.Name}}' $(docker ps -q -f name=dtr-rethink) | cut -f 3 -d '-')
&& echo $REPLICA_ID

Then use the MKE client bundle to remove the unhealthy replicas:

docker run -it --rm mirantis/dtr:2.8.13 remove \
  --existing-replica-id <healthy-replica-id> \
  --replica-ids <unhealthy-replica-id> \
  --ucp-insecure-tls \
  --ucp-url <mke-url> \
  --ucp-username <user> \
  --ucp-password <password>

You can remove more than one replica at the same time, by specifying multiple IDs with a comma.

Join replicas¶

Once you’ve removed the unhealthy nodes from the cluster, you should join new ones to make sure your cluster is highly available.

Use your MKE client bundle to run the following command which prompts you for the necessary parameters:

docker run -it --rm \
  mirantis/dtr:2.8.13 join \
  --ucp-node <mke-node-name> \
  --ucp-insecure-tls

Where to go next¶

Disaster recovery overview