Repair a single replica¶

When one or more MSR replicas are unhealthy but the overall majority (n/2 + 1) is healthy and able to communicate with one another, your MSR cluster is still functional and healthy.

Given that the MSR cluster is healthy, there is no need to execute a disaster recovery procedure, such as restoring from a backup. Instead, you should:

Remove the unhealthy replicas from the MSR cluster.
Join new replicas to make MSR highly available.

The order in which you perform these operations is important, as an MSR cluster requires a majority of replicas to be healthy at all times. If you join more replicas before removing the ones that are unhealthy, your MSR cluster might become unhealthy.

Split-brain scenario¶

To understand why you should remove unhealthy replicas before joining new ones, imagine you have a five-replica MSR deployment, and something goes wrong with the overlay network connection the replicas, causing them to be separated in two groups.

Because the cluster originally had five replicas, it can work as long as three replicas are still healthy and able to communicate (5 / 2 + 1 = 3). Even though the network separated the replicas in two groups, MSR is still healthy.

If at this point you join a new replica instead of fixing the network problem or removing the two replicas that got isolated from the rest, it’s possible that the new replica ends up in the side of the network partition that has less replicas.

When this happens, both groups now have the minimum amount of replicas needed to establish a cluster. This is also known as a split-brain scenario, because both groups can now accept writes and their histories start diverging, making the two groups effectively two different clusters.

Configure replicas¶

MSR on Swarm¶

To add or remove MSR on Swarm nodes you must reconfigure the application with the list of nodes.

Obtain MSR on Swarm configuration file:

docker run -it --rm --entrypoint \
cat registry.mirantis.com/msr/msr-installer:3.1.0 \
/config/values.yml > newvalues-swarm.yaml

Edit the newvalues-swarm.yaml file and specify the worker nodes on which MSR is to be deployed:

swarm:
  ## nodeList is a comma separated list of node IDs within the swarm that represent nodes that MSR will be allowed to
  ## deploy to.To retrieve a list of nodes within a swarm execute `docker node ls`. If no nodes are specified then MSR
  ## will be installed on the current node.
  ##
  nodeList:

Update MSR.

Scale Helm deployment¶

To scale your Helm deployment, you must first obtain your MSR deployment:

kubectl get deployment

Next, run the following command to add and remove replicas from your MSR deployment.

kubectl scale deployment --replicas=3 <deployment-name>

Example:

kubectl scale deployment --replicas=3 msr-api

For comprehensive information on how to scale MSR on Helm up and down as a Kubernetes application, refer to the Kubernetes documentation Running Multiple Instances of Your App.

MSR Operator¶

MSR Operator uses its own lifecycle manager, and thus the number of replicas are controlled by MSR CRD manifest.

To increase/decrease the number of replicas for MSR Operator, you must adjust the replicaCount: parameters in the manifest _v1_msr.yaml file. After doing so, and following a reapplication of the CRD manifest, the required replica count is spawned.

Where to go next¶

Disaster recovery overview