This section instructs you on how to replace a failed physical node with a Ceph OSD or multiple OSD nodes running on it using the Ceph - replace failed OSD Jenkins pipeline.
To replace a failed physical node with a Ceph OSD or multiple OSD nodes:
Log in to the Jenkins web UI.
Open the Ceph - replace failed OSD pipeline.
Specify the following parameters:
Parameter |
Description and values |
---|---|
SALT_MASTER_CREDENTIALS |
The Salt Master credentials to use for connection, defaults to
|
SALT_MASTER_URL |
The Salt Master node host URL with the |
HOST |
Add the Salt target name of the Ceph OSD node. For example,
|
OSD |
Add a comma-separated list of Ceph OSDs on the specified |
DEVICE 0 |
Add a comma-separated list of failed devices to replace at |
DATA_PARTITION: 0 |
(Optional) Add a comma-separated list of mounted partitions of
the failed device. These partitions will be unmounted. We recommend
that multiple OSD nodes per device are used. For example,
|
JOURNAL_BLOCKDB_BLOCKWAL_PARTITION: 0 |
Add a comma-separated list of partitions that store |
ADMIN_HOST |
Add |
CLUSTER_FLAGS |
Add a comma-separated list of flags to apply before and after the pipeline. |
WAIT_FOR_HEALTHY |
Select to perform the Ceph health check within the pipeline. |
DMCRYPT 0 |
Select if you are replacing an encrypted OSD. In such case, also
specify |
Click Deploy.
The Ceph - replace failed OSD pipeline workflow:
Mark the Ceph OSD as out
.
Wait until the Ceph cluster is in a healthy state if WAIT_FOR_HEALTHY was selected. In this case. Jenkins pauses the execution of the pipeline until the data migrates to a different Ceph OSD.
Stop the Ceph OSD service.
Remove the Ceph OSD from the CRUSH map.
Remove the Ceph OSD authentication key.
Remove the Ceph OSD from the Ceph cluster.
Unmount data partition(s) of the failed disk.
Delete the partition table of the failed disk.
Remove the partition from the block_db
, block_wal
, or journal.
Perform one of the following depending on the MCP release version:
For deployments prior to the MCP 2019.2.3 update, redeploy the failed Ceph OSD.
For deployments starting from the MCP 2019.2.3 update:
Wait for the hardware replacement and confirmation to proceed.
Redeploy the failed Ceph OSD on the replaced hardware.
Note
If any of the steps 1 - 9 has already been performed manually, Jenkins proceeds to the next step.