Restore a Galera cluster and database automatically

Restore a Galera cluster and database automaticallyΒΆ

The Galera cluster ensures that the OpenStack services are operable. In case of a cluster outage, the number of manual steps to start the cluster, as well as ensuring the necessary access can significantly delay the restoration of services and is prone to operator errors. Therefore, to reduce the complexity of the procedure and support greater scalability, MCP provides the automatic way to verify and restore the Galera cluster in your deployment.

This section describes how to verify the status of a Galera cluster and restore it using the Verify and Restore Galera cluster Jenkins pipeline. Use the automatic restoration procedure only if 1 Galera node is down or the data is corrupted. Otherwise, apply the manual procedure adjusted to the needs of your deployment as described in Restore a Galera cluster manually.

Note

This feature is available starting from the MCP 2019.2.5 maintenance update. Before enabling the feature, follow the steps described in Apply maintenance updates.

Note

The Verify and Restore Galera cluster Jenkins pipeline restores the Galera cluster with the provided configuration and does not fix the issues caused by cluster misconfiguration.

To restore the Galera cluster and database automatically:

  1. Log in to the Jenkins web UI.

  2. Open the Verify and Restore Galera cluster pipeline.

  3. Specify the required parameters:

    Parameter Description and values
    SALT_MASTER_URL Add the IP address of your Salt Master node host and the salt-api port. For example, http://172.18.170.27:6969.
    CREDENTIALS_ID Add credentials_id as credentials for the connection.
    RESTORE_TYPE Check ONLY_RESTORE if manual backup has been performed already. The created backup will be used during the restoration. Check BACKUP_AND_RESTORE if backup has not been performed and is required to be performed during the pipeline run.
    ASK_CONFIRMATION Set to False if you do not want the pipeline to wait for a manual confirmation before running the restoration. Defaults to True.
    CHECK_TIME_SYNC Set to False if you do not want the pipeline to verify the time synchronization accross the nodes. Defaults to True.
    VERIFICATION_RETRIES Specify the number of retries of the verification process after the restoration was performed. The value should be increased for the bigger clusters as it may take more time for such clusters to come up and synchronize. Defaults to 5.
  4. Click Deploy.

    The pipeline workflow:

    1. The verification stage:

      1. Obtaining and parsing the result of the mysql.status call.
      2. Formatting and printing a result report to the user.

      Example of a verification report:

       CLUSTER STATUS REPORT: 6 expected values, 0 warnings and 1 error found:
      
       [OK     ] Cluster status: Primary (Expected: Primary)
       [OK     ] Master node status: true (Expected: ON or true)
       [OK     ] Master node status comment: Synced (Expected: Joining or Waiting on SST
                 or Joined or Synced or Donor)
       [OK     ] Master node connectivity: true (Expected: ON or true)
       [OK     ] Average size of local reveived queue: 0.166667 (Expected: below 0.5)
                 (Value above 0 means that the node cannot apply write-sets as fast
                 as it receives them, which can lead to replication throttling)
       [OK     ] Average size of local send queue: 0.010204 (Expected: below 0.5)
                 (Value above 0 indicate replication throttling or network throughput
                 issues, such as a bottleneck on the network link.)
      
       [  ERROR] Current cluster size: 2 (Expected: 3)
      
       Errors found.
      
      There's something wrong with the cluster, do you want to run a restore?
      
       Are you sure you want to run a restore? Click to confirm
       Proceed or Abort
      
    2. Optional. The backup stage:

      Running the Galera database backup pipeline. For the pipeline workflow, see Create an instant backup of a MySQL database automatically.

    3. The restoration stage:

      1. If Proceed is selected, the restoration stage will continue. Otherwise, it will abort.
      2. The last shutdown node will be used as a source of truth.
    4. The verification stage:

      Verifying the status of the cluster.

  5. After the restoration is finalized, verify that all nodes are back and the cluster is working.

  6. Revert the changes made in the cluster/openstack/database/init.yml file in the step 2 during Prepare for a Galera cluster restoration.