Repair a single replica

When one or more MSR replicas are unhealthy but the overall majority (n/2 + 1) is healthy and able to communicate with one another, your MSR cluster is still functional and healthy.

Cluster with two nodes unhealthy

Given that the MSR cluster is healthy, there is no need to execute a disaster recovery procedure, such as restoring from a backup. Instead, you should:

Instead, you should:

  1. Remove the unhealthy replicas from the MSR cluster.

  2. Join new replicas to make MSR highly available.

The order in which you perform these operations is important, as an MSR cluster requires a majority of replicas to be healthy at all times. If you join more replicas before removing the ones that are unhealthy, your MSR cluster might become unhealthy.

Split-brain scenario

To understand why you should remove unhealthy replicas before joining new ones, imagine you have a five-replica MSR deployment, and something goes wrong with the overlay network connection the replicas, causing them to be separated in two groups.

Cluster with network problem

Because the cluster originally had five replicas, it can work as long as three replicas are still healthy and able to communicate (5 / 2 + 1 = 3). Even though the network separated the replicas in two groups, MSR is still healthy.

If at this point you join a new replica instead of fixing the network problem or removing the two replicas that got isolated from the rest, it is possible that the new replica ends up in the side of the network partition that has less replicas.

cluster with split brain

When this happens, both groups now have the minimum amount of replicas needed to establish a cluster. This is also known as a split-brain scenario, because both groups can now accept writes and their histories start diverging, making the two groups effectively two different clusters.

Scale Helm deployment

Important

With MSR 3.0 you can configure the number of replicas, however you cannot add or remove separate replicas.

To scale your Helm deployment, you must first obtain your MSR deployment:

kubectl get deployment

Next, run the following command to add and remove replicas from your MSR deployment.

kubectl scale deployment --replicas=3 <deployment-name>

Example:

kubectl scale deployment --replicas=3 msr-api

For comprehensive information on how to scale MSR on Helm up and down as a Kubernetes application, refer to the Kubernetes documenation Running Multiple Instances of Your App.