Replace a failed Ceph OSD

Replace a failed Ceph OSD¶

This section instructs you on how to replace a failed physical node with a Ceph OSD or multiple OSD nodes running on it using the Ceph - replace failed OSD Jenkins pipeline.

To replace a failed physical node with a Ceph OSD or multiple OSD nodes:

Log in to the Jenkins web UI.
Open the Ceph - replace failed OSD pipeline.

Specify the following parameters:

Parameter	Description and values
SALT_MASTER_CREDENTIALS	The Salt Master credentials to use for connection, defaults to `salt`.
SALT_MASTER_URL	The Salt Master node host URL with the `salt-api` port, defaults to the `jenkins_salt_api_url` parameter. For example, `http://172.18.170.27:6969`.
HOST	Add the Salt target name of the Ceph OSD node. For example, `osd005*`.
OSD	Add a comma-separated list of Ceph OSDs on the specified `HOST` node. For example `1,2`.
DEVICE [0]	Add a comma-separated list of failed devices to replace at `HOST`. For example, `/dev/sdb,/dev/sdc`.
DATA_PARTITION: [0]	(Optional) Add a comma-separated list of mounted partitions of the failed device. These partitions will be unmounted. We recommend that multiple OSD nodes per device are used. For example, `/dev/sdb1,/dev/sdb3`.
JOURNAL_BLOCKDB_BLOCKWAL_PARTITION: [0]	Add a comma-separated list of partitions that store `journal`, `block_db`, or `block_wal` of the failed devices on the specified `HOST`. For example, `/dev/sdh2,/dev/sdh3`.
ADMIN_HOST	Add `cmn01*` as the Ceph cluster node with the admin keyring.
CLUSTER_FLAGS	Add a comma-separated list of flags to apply before and after the pipeline.
WAIT_FOR_HEALTHY	Select to perform the Ceph health check within the pipeline.
DMCRYPT [0]	Select if you are replacing an encrypted OSD. In such case, also specify `noout,norebalance` as `CLUSTER_FLAGS`.

Click Deploy.

The Ceph - replace failed OSD pipeline workflow:

Mark the Ceph OSD as out.
Wait until the Ceph cluster is in a healthy state if WAIT_FOR_HEALTHY was selected. In this case. Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.
Stop the Ceph OSD service.
Remove the Ceph OSD from the CRUSH map.
Remove the Ceph OSD authentication key.
Remove the Ceph OSD from the Ceph cluster.
Unmount data partition(s) of the failed disk.
Delete the partition table of the failed disk.
Remove the partition from the block_db, block_wal, or journal.
Perform one of the following depending on the MCP release version:
- For deployments prior to the MCP 2019.2.3 update, redeploy the failed Ceph OSD.
- For deployments starting from the MCP 2019.2.3 update:
  1. Wait for the hardware replacement and confirmation to proceed.
  2. Redeploy the failed Ceph OSD on the replaced hardware.

Note

If any of the steps 1 - 9 has already been performed manually, Jenkins proceeds to the next step.

[0]	(1, 2, 3, 4) The parameter has been removed starting from the MCP 2019.2.3 update.

updated: 2025-01-10 08:56

Remove a Ceph OSD daemon

View Previous Section

Restrict the RADOS Gateway capabilities

Replace a failed Ceph OSD

Replace a failed Ceph OSD¶

View Previous Section

View Next Section