Replace a failed Ceph OSD with a metadata device as a logical volume path¶
You can apply the below procedure in the following cases:
A Ceph OSD failed without data or metadata device outage. In this case, first remove a failed Ceph OSD and clean up all corresponding disks and partitions. Then add a new Ceph OSD to the same data and metadata paths.
A Ceph OSD failed with data or metadata device outage. In this case, you also first remove a failed Ceph OSD and clean up all corresponding disks and partitions. Then add a new Ceph OSD to a newly replaced data device with the same metadata path.
Note
The below procedure also applies to manually created metadata partitions.
Remove a failed Ceph OSD by ID with a defined metadata device¶
Identify the ID of Ceph OSD related to a failed device. For example, use the Ceph CLI in the
rook-ceph-tools
Pod:ceph osd metadata
Example of system response:
{ "id": 0, ... "bluestore_bdev_devices": "vdc", ... "devices": "vdc", ... "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf", ... "pod_name": "rook-ceph-osd-0-7b8d4d58db-f6czn", ... }, { "id": 1, ... "bluefs_db_devices": "vdf", ... "bluestore_bdev_devices": "vde", ... "devices": "vde,vdf", ... "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf", ... "pod_name": "rook-ceph-osd-1-78fbc47dc5-px9n2", ... }, ...
Open the
KaasCephCluster
custom resource (CR) for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section:Find and capture the
metadataDevice
path to reuse it during re-creation of the Ceph OSD.Remove the required device:
Example configuration snippet:
spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # remove the entire item from the storageDevices list # fullPath: <deviceByPath> if device is specified using by-path instead of name config: deviceClass: hdd metadataDevice: /dev/bluedb/meta_1
In the example above,
<machineName>
is the name of machine that relates to the node on which the device<deviceName>
or<deviceByPath>
must be replaced.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-osd-<machineName>-<osdID>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-osd-<machineName>-<deviceName> namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: cleanupByOsdId: - <osdID> kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute the following parameters:
<machineName>
and<deviceName>
with the machine and device names from the previous step<managedClusterProjectName>
with the cluster project name<osdID>
with the ID of the affected Ceph OSD<kaasCephClusterName>
with theKaaSCephCluster
resource name<managedClusterProjectName>
with the project name of the related managed cluster
Apply the template to the cluster:
kubectl apply -f replace-failed-osd-<machineName>-<osdID>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
status
section ofKaaSCephOperationRequest
contains theremoveInfo
section:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<osdID> -o yaml
Example of system response:
childNodesMapping: <nodeName>: <machineName> removeInfo: cleanUpMap: <nodeName>: osdMapping: "<osdID>": deviceMapping: <dataDevice>: deviceClass: hdd devicePath: <dataDeviceByPath> devicePurpose: block usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2 zapDisk: true <metadataDevice>: deviceClass: hdd devicePath: <metadataDeviceByPath> devicePurpose: db usedPartition: /dev/bluedb/meta_1 uuid: ef516477-d2da-492f-8169-a3ebfc3417e2
Definition of values in angle brackets:
<machineName>
- name of the machine on which the device is being replaced, for example,worker-1
<nodeName>
- underlying node name of the machine, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
<osdId>
- Ceph OSD ID for the device being replaced, for example,1
<dataDevice>
- name of the device placed on the node, for example,/dev/vde
<dataDeviceByPath>
-by-path
of the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
<metadataDevice>
- metadata name of the device placed on the node, for example,/dev/vdf
<metadataDeviceByPath>
- metadataby-path
of the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:12.0
Note
The partitions that are manually created or configured using the
BareMetalHostProfile
object can be removed only manually, or during a complete metadata disk removal, or during theMachine
object removal or re-provisioning.Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<osdID> -o yaml
Example of system response:
status: phase: ApproveWaiting
In the
KaaSCephOperationRequest
CR, set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<osdID>
Configuration snippet:
spec: osdRemove: approve: true
Review the following
status
fields of theKaaSCephOperationRequest
CR request processing:status.phase
- current state of request processingstatus.messages
- description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Re-create a Ceph OSD with the same metadata partition¶
Open the
KaasCephCluster
CR for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, add the replaced device with the samemetadataDevice
path as on the removed Ceph OSD. For example:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceByID> # Recommended. Add a new device by ID, for example, /dev/disk/by-id/... #fullPath: <deviceByPath> # Add a new device by path, for example, /dev/disk/by-path/... config: deviceClass: hdd metadataDevice: /dev/bluedb/meta_1 # Must match the value of the previously removed OSD
Substitute
<machineName>
with the machine name of the node where the new device<deviceByID>
or<deviceByPath>
must be added.Wait for the replaced disk to apply to the Ceph cluster as a new Ceph OSD.
You can monitor the application state using either the
status
section of theKaaSCephCluster
CR or in therook-ceph-tools
Pod:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s