Repair a cluster¶
For an MSR cluster to be healthy, a majority of its replicas (n/2 + 1) need to be healthy and be able to communicate with the other replicas. This is known as maintaining quorum.
In a scenario where quorum is lost, but at least one replica is still accessible, you can use that replica to repair the cluster. That replica doesn’t need to be completely healthy. The cluster can still be repaired as the MSR data volumes are persisted and accessible.
Repairing the cluster from an existing replica minimizes the amount of data lost. If this procedure doesn’t work, you’ll have to restore from an existing backup.
Diagnose an unhealthy cluster¶
When a majority of replicas are unhealthy, causing the overall MSR
cluster to become unhealthy, internal server error
presents for operations
such as docker login , docker pull , and
docker push.
Accessing the /_ping
endpoint of any replica also returns the same
error. It is also possible that the MSR web UI is partially or fully
unresponsive.
Using the msr db scale command returns an error such as:
{"level":"fatal","msg":"unable to reconfigure replication: unable to
reconfigure replication for table \"org_membership\": unable to
reconfigure database replication: rethinkdb: The server(s) hosting table
`enzi.org_membership` are currently unreachable. The table was not
reconfigured. If you do not expect the server(s) to recover, you can use
`emergency_repair` to restore availability of the table.
\u003chttp://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode\u003e
in:\nr.DB(\"enzi\").Table(\"org_membership\").Reconfigure(replicas=1, shards=1)","time":"2022-12-09T20:13:47Z"}
command terminated with exit code 1
Perform an emergency repair¶
Use the msr db emergency-repair command to repair an
unhealthy MSR cluster from the msr-api
Deployment.
This command overrides the standard safety checks that occur when scaling a
RethinkDB cluster. This allows RethinkDB to modify the replication factor to
the setting most appropriate for the number of rethinkdb-cluster
Pods that
are connected to the database.
The msr db emergency-repair command is commonly used when the
msr db scale command is no longer able to reliably scale the
database. This typically occurs when there is a prior loss of quorum, which
often happens when you scale rethinkdb.cluster.replicaCount
without first
decommissioning and scaling RethinkDB servers. For more information on scaling
down RethinkDB servers, refer to Remove replicas from RethinkDB.
Run the following command to perform an emergency repair:
kubectl exec deploy/msr-api -- msr db emergency-repair
Specify the number of replicas in the values.yml
file and run:
docker run -v $(pwd)/values.yml:/config/values.yml -v /var/run/docker.sock:/var/run/docker.sock -it msr-installer apply