Repair a cluster

For a MSR cluster to be healthy, a majority of its replicas (n/2 + 1) need to be healthy and be able to communicate with the other replicas. This is known as maintaining quorum.

In a scenario where quorum is lost, but at least one replica is still accessible, you can use that replica to repair the cluster. That replica doesn’t need to be completely healthy. The cluster can still be repaired as the MSR data volumes are persisted and accessible.

Unhealthy cluster

Repairing the cluster from an existing replica minimizes the amount of data lost. If this procedure doesn’t work, you’ll have to restore from an existing backup.

Diagnose an unhealthy cluster

When a majority of replicas are unhealthy, causing the overall MSR cluster to become unhealthy, operations like docker login, docker pull, and docker push present internal server error.

Accessing the /_ping endpoint of any replica also returns the same error. It’s also possible that the MSR web UI is partially or fully unresponsive.

Perform an emergency repair

Use the mirantis/dtr emergency-repair command to try to repair an unhealthy MSR cluster, from an existing replica.

This command checks the data volumes for the MSR replica are uncorrupted, redeploys all internal MSR components and reconfigured them to use the existing volumes. It also reconfigures MSR removing all other nodes from the cluster, leaving MSR as a single-replica cluster with the replica you chose.

Start by finding the ID of the MSR replica that you want to repair from. You can find the list of replicas by navigating to Shared Resources > Stacks or Swarm > Volumes (when using swarm mode) on the MKE web interface, or by using a MKE client bundle to run:

docker ps --format "{{.Names}}" | grep dtr

# The list of MSR containers with <node>/<component>-<replicaID>, e.g.
# node-1/dtr-api-a1640e1c15b6

Another way to determine the replica ID is to SSH into a MSR node and run the following:

REPLICA_ID=$(docker inspect -f '{{.Name}}' $(docker ps -q -f name=dtr-rethink) | cut -f 3 -d '-')
&& echo $REPLICA_ID

Then, use your MKE client bundle to run the emergency repair command:

docker run -it --rm mirantis/dtr:2.9.16 emergency-repair \
  --ucp-insecure-tls \
  --existing-replica-id <replica-id>

If the emergency repair procedure is successful, your MSR cluster now has a single replica. You should now join more replicas for high availability.

Note

Learn more about the high availability configuration in Set up high availability.

If the emergency repair command fails, try running it again using a different replica ID. As a last resort, you can restore your cluster from an existing backup.

Where to go next