Repair a cluster

For a MSR cluster to be healthy, a majority of its replicas (n/2 + 1) need to be healthy and be able to communicate with the other replicas. This is known as maintaining quorum.

In a scenario where quorum is lost, but at least one replica is still accessible, you can use that replica to repair the cluster. That replica doesn’t need to be completely healthy. The cluster can still be repaired as the MSR data volumes are persisted and accessible.

Unhealthy cluster

Repairing the cluster from an existing replica minimizes the amount of data lost. If this procedure doesn’t work, you’ll have to restore from an existing backup.

Diagnose an unhealthy cluster

When a majority of replicas are unhealthy, causing the overall MSR cluster to become unhealthy, internal server error presents for operations such as docker login , docker pull , and docker push.

Accessing the /_ping endpoint of any replica also returns the same error. It is also possible that the MSR web UI is partially or fully unresponsive.

Using the msr db scale command returns an error such as:

{"level":"fatal","msg":"unable to reconfigure replication: unable to
reconfigure replication for table \"org_membership\": unable to
reconfigure database replication: rethinkdb: The server(s) hosting table
`enzi.org_membership` are currently unreachable. The table was not
reconfigured. If you do not expect the server(s) to recover, you can use
`emergency_repair` to restore availability of the table.
\u003chttp://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode\u003e
in:\nr.DB(\"enzi\").Table(\"org_membership\").Reconfigure(replicas=1, shards=1)","time":"2022-12-09T20:13:47Z"}
command terminated with exit code 1

Perform an emergency repair

Use the msr db emergency-repair command to repair an unhealthy MSR cluster from the msr-api Deployment.

This command overrides the standard safety checks that occur when scaling a RethinkDB cluster. This allows RethinkDB to modify the replication factor to the setting most appropriate for the number of rethinkdb-cluster Pods that are connected to the database.

The msr db emergency-repair command is commonly used when the msr db scale command is no longer able to reliably scale the database. This typically occurs when there is a prior loss of quorum, which often happens when you scale rethinkdb.cluster.replicaCount without first decommissioning and scaling RethinkDB servers. For more information on scaling down RethinkDB servers, refer to Remove replicas from RethinkDB.

Run the following command to perform an emergency repair:

kubectl exec deploy/msr-api -- msr db emergency-repair