Replace a failed Ceph OSD¶
After a physical disk replacement, you can use Ceph LCM API to redeploy a failed Ceph OSD. The common flow of replacing a failed Ceph OSD is as follows:
Remove the obsolete Ceph OSD from the Ceph cluster by device name, by Ceph OSD ID, or by path.
Add a new Ceph OSD on the new disk to the Ceph cluster.
Note
Ceph OSD replacement presupposes usage of a
KaaSCephOperationRequest
CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph OSD by device name, path, or ID¶
Warning
The procedure below presuppose that the Operator knows the exact
device name, by-path
, or by-id
of the replaced device, as well as on
which node the replacement occurred.
Warning
Since Container Cloud 2.23.0 and 2.23.1 for MOSK
23.1, a Ceph OSD removal using by-path
, by-id
, or device name is
not supported if a device was physically removed from a node. Therefore, use
cleanupByOsdId
instead. For details, see
Remove a failed Ceph OSD by Ceph OSD ID.
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, remove the required device:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # remove the entire item from storageDevices list # fullPath: <deviceByPath> if device is specified with by-path instead of name config: deviceClass: hdd
Substitute
<machineName>
with the machine name of the node where the device<deviceName>
or<deviceByPath>
is going to be replaced.Save
KaaSCephCluster
and close the editor.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-osd-<machineName>-<deviceName>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-osd-<machineName>-<deviceName> namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: cleanupByDevice: - name: <deviceName> # If a device is specified with by-path or by-id (since Container # Cloud 2.19.0 and 2.20.1 for MOSK 22.4) instead of # name, path: <deviceByPath> or <deviceById>. kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute
<kaasCephClusterName>
with the correspondingKaaSCephCluster
resource from the<managedClusterProjectName>
namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-osd-<machineName>-<deviceName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
removeInfo
section appeared in theKaaSCephOperationRequest
CRstatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>
- machine name where the replacement occurs, for example,worker-1
.<nodeName>
- underlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
.<osdId>
- actual Ceph OSD ID for the device being replaced, for example,1
.<deviceName>
- actual device name placed on the node, for example,sdb
.<deviceByPath>
- actual deviceby-path
placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
.
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequest
CR and set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName>
For example:
spec: osdRemove: approve: true
Review the status of the
KaaSCephOperationRequest
resource request processing. The valuable parameters are as follows:status.phase
- the current state of request processingstatus.messages
- the description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- contain error and warning messages occurred during request processing
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Remove a failed Ceph OSD by Ceph OSD ID¶
Warning
The procedure below presupposes that the Operator knows only the failed Ceph OSD ID.
Obtain the
osd-device
mapping from thestatus
section of theKaaSCephCluster
CR:kubectl get kaascephcluster -n <managedClusterProjectName> -o yaml
Substitute
<managedClusterProjectName>
with the corresponding value.For example:
status: fullClusterInfo: cephDetails: cephDeviceMapping: <nodeName>: <osdId>: <deviceName>
<nodeName>
- the corresponding node name that hosts the Ceph OSD<osdId>
- the ID of the Ceph OSD to replace<deviceName>
- an actual device name to replace
Obtain
<machineName>
for<nodeName>
where the Ceph OSD is placed:kubectl -n rook-ceph get node -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.kaas\.mirantis\.com\/machine-name}{"\n"}{end}'
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, remove the required device:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # remove the entire item from storageDevices list config: deviceClass: hdd
Substitute
<machineName>
with the machine name of the node where the device<deviceName>
is going to be replaced.Save
KaaSCephCluster
and close the editor.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-<machineName>-osd-<osdId>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-osd-<osdId> namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: cleanupByOsdId: - <osdId> kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute
<kaasCephClusterName>
with the correspondingKaaSCephCluster
resource from the<managedClusterProjectName>
namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-osd-<osdId>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
removeInfo
section appeared in theKaaSCephOperationRequest
CRstatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
Example of system response
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>
- machine name where the replacement occurs, for example,worker-1
.<nodeName>
- nderlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
.<osdId>
- actual Ceph OSD ID for the device being replaced, for example,1
.<deviceName>
- actual device name placed on the node, for example,sdb
.<deviceByPath>
- actual deviceby-path
placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
.
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequest
CR and set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request
For example:
spec: osdRemove: approve: true
Review the status of the
KaaSCephOperationRequest
resource request processing. The valuable parameters are as follows:status.phase
- the current state of request processingstatus.messages
- the description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- contain error and warning messages occurred during request processing
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new device after removal of a failed one¶
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, add a new device:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd
Substitute
<machineName>
with the machine name of the node where device<deviceName>
or<deviceByPath>
is going to be added as a Ceph OSD.Verify that the new Ceph OSD has appeared in the Ceph cluster and is
in
andup
. ThefullClusterInfo
section should not contain any issues.kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
For example:
status: fullClusterInfo: daemonStatus: osd: running: '3/3 running: 3 up, 3 in' status: Ok