Replace a failed Ceph OSD

Replace a failed Ceph OSDΒΆ

This section instructs you on how to replace a failed physical node with a Ceph OSD or multiple OSD nodes running on it using the Ceph - replace failed OSD Jenkins pipeline.

To replace a failed physical node with a Ceph OSD or multiple OSD nodes:

  1. Log in to the Jenkins web UI.

  2. Open the Ceph - replace failed OSD pipeline.

  3. Specify the following parameters:

    Parameter

    Description and values

    SALT_MASTER_CREDENTIALS

    The Salt Master credentials to use for connection, defaults to salt.

    SALT_MASTER_URL

    The Salt Master node host URL with the salt-api port, defaults to the jenkins_salt_api_url parameter. For example, http://172.18.170.27:6969.

    HOST

    Add the Salt target name of the Ceph OSD node. For example, osd005*.

    OSD

    Add a comma-separated list of Ceph OSDs on the specified HOST node. For example 1,2.

    DEVICE 0

    Add a comma-separated list of failed devices to replace at HOST. For example, /dev/sdb,/dev/sdc.

    DATA_PARTITION: 0

    (Optional) Add a comma-separated list of mounted partitions of the failed device. These partitions will be unmounted. We recommend that multiple OSD nodes per device are used. For example, /dev/sdb1,/dev/sdb3.

    JOURNAL_BLOCKDB_BLOCKWAL_PARTITION: 0

    Add a comma-separated list of partitions that store journal, block_db, or block_wal of the failed devices on the specified HOST. For example, /dev/sdh2,/dev/sdh3.

    ADMIN_HOST

    Add cmn01* as the Ceph cluster node with the admin keyring.

    CLUSTER_FLAGS

    Add a comma-separated list of flags to apply before and after the pipeline.

    WAIT_FOR_HEALTHY

    Select to perform the Ceph health check within the pipeline.

    DMCRYPT 0

    Select if you are replacing an encrypted OSD. In such case, also specify noout,norebalance as CLUSTER_FLAGS.

  4. Click Deploy.

The Ceph - replace failed OSD pipeline workflow:

  1. Mark the Ceph OSD as out.

  2. Wait until the Ceph cluster is in a healthy state if WAIT_FOR_HEALTHY was selected. In this case. Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.

  3. Stop the Ceph OSD service.

  4. Remove the Ceph OSD from the CRUSH map.

  5. Remove the Ceph OSD authentication key.

  6. Remove the Ceph OSD from the Ceph cluster.

  7. Unmount data partition(s) of the failed disk.

  8. Delete the partition table of the failed disk.

  9. Remove the partition from the block_db, block_wal, or journal.

  10. Perform one of the following depending on the MCP release version:

    • For deployments prior to the MCP 2019.2.3 update, redeploy the failed Ceph OSD.

    • For deployments starting from the MCP 2019.2.3 update:

      1. Wait for the hardware replacement and confirmation to proceed.

      2. Redeploy the failed Ceph OSD on the replaced hardware.

Note

If any of the steps 1 - 9 has already been performed manually, Jenkins proceeds to the next step.

0(1,2,3,4)

The parameter has been removed starting from the MCP 2019.2.3 update.