Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly MCC). This means everything you need is in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Replace a failed Ceph OSD¶
Warning
This procedure is valid for MOSK clusters that use the MiraCeph
custom
resource (CR), which is available since MOSK 25.2 to replace the deprecated
KaaSCephCluster
. For the equivalent procedure with the KaaSCephCluster
CR, refer to the following section:
After a physical disk replacement, you can use Ceph LCM API to redeploy a failed Ceph OSD. The common flow of replacing a failed Ceph OSD is as follows:
Remove the obsolete Ceph OSD from the Ceph cluster by device name, by Ceph OSD ID, or by path.
Add a new Ceph OSD on the new disk to the Ceph cluster.
Note
Ceph OSD replacement presupposes usage of a CephOsdRemoveRequest
CR. For workflow overview, spec and phases description, see
High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph OSD by device name, path, or ID¶
Warning
The procedure below presuppose that the operator knows the exact
device name, by-path
, or by-id
of the replaced device, as well as on
which node the replacement occurred.
Warning
A Ceph OSD removal using by-path
, by-id
, or device name is
not supported if a device was physically removed from a node. Therefore, use
cleanupByOsdId
instead. For details, see
Remove a failed Ceph OSD by Ceph OSD ID.
Warning
Since MOSK 23.3, Mirantis does not recommend setting device
name
or device by-path
symlink in the cleanupByDevice
field
as these identifiers are not persistent and can change at node boot. Remove
Ceph OSDs with by-id
symlinks specified in the path
field or use
cleanupByOsdId
instead. For details, see Addressing storage devices since MOSK 25.2.
Open the
MiraCeph
CR of a MOSK cluster for editing:kubectl -n ceph-lcm-mirantis edit miraceph
In the
nodes
section, remove the required device fromdevices
. When using device filters, update thedeviceFilter
ordevicePathFilter
regexp accordingly.For example:
spec: nodes: - name: <nodeName> devices: - name: <deviceName> # remove the entire item from devices list # fullPath: <deviceByPath> if device is specified with symlink instead of name config: deviceClass: hdd
Substitute
<nodeName>
with the node name where the device<deviceName>
or<deviceByPath>
is going to be replaced.Save
MiraCeph
and close the editor.Create a
CephOsdRemoveRequest
CR template and save it asreplace-failed-osd-<nodeName>-<deviceName>-request.yaml
:apiVersion: lcm.mirantis.com/v1alpha1 kind: CephOsdRemoveRequest metadata: name: replace-failed-osd-<nodeName>-<deviceName> namespace: ceph-lcm-mirantis spec: nodes: <nodeName>: cleanupByDevice: - name: <deviceName> # If a device is specified with by-path or by-id instead of # name, path: <deviceByPath> or <deviceById>.
Apply the template to the cluster:
kubectl apply -f replace-failed-osd-<nodeName>-<deviceName>-request.yaml
Verify that the corresponding request has been created:
kubectl -n ceph-lcm-mirantis get cephosdremoverequest
Verify that the
removeInfo
section appeared in theCephOsdRemoveRequest
CRstatus
:kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-osd-<nodeName>-<deviceName> -o yaml
Example of system response:
status: osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <dataDevice>: deviceClass: hdd devicePath: <dataDeviceByPath> devicePurpose: block usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2 zapDisk: true
Definition of values in angle brackets:
<machineName>
- name of the machine on which the device is being replaced, for example,worker-1
<nodeName>
- underlying node name of the machine, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
<osdId>
- Ceph OSD ID for the device being replaced, for example,1
<dataDeviceByPath>
-by-path
of the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
<dataDevice>
- name of the device placed on the node, for example,/dev/sdb
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-osd-<nodeName>-<deviceName> -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
CephOsdRemoveRequest
CR and set theapprove
flag totrue
:kubectl -n ceph-lcm-mirantis edit cephosdremoverequest replace-failed-osd-<nodeName>-<deviceName>
For example:
spec: approve: true
Review the following
status
fields of the Ceph LCM CR request processing:status.phase
- current state of request processingstatus.messages
- description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- error and warning messages occurred during request processing, if any
Verify that the
CephOsdRemoveRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Remove a failed Ceph OSD by Ceph OSD ID¶
Caution
The procedure below presupposes that the operator knows only the failed Ceph OSD ID.
Identify the node and device names used by the affected Ceph OSD. Using the Ceph CLI in the
rook-ceph-tools
Pod, run:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd metadata <osdId>
Substitute
<osdId>
with the affected OSD ID.Example output:
{ "id": 1, ... "bluefs_db_devices": "vdc", ... "bluestore_bdev_devices": "vde", ... "devices": "vdc,vde", ... "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf", ... },
In the example above,
hostname
is the node name anddevices
are all devices used by the affected Ceph OSD.Open the
MiraCeph
CR on a MOSK cluster for editing:kubectl -n ceph-lcm-mirantis edit miraceph
In the
nodes
section, remove the required device:spec: nodes: - name: <nodeName> devices: - name: <deviceName> # remove the entire item from devices list config: deviceClass: hdd
Substitute
<nodeName>
with the node name where the device<deviceName>
is going to be replaced.Save
MiraCeph
and close the editor.Create a
CephOsdRemoveRequest
CR template and save it asreplace-failed-<nodeName>-osd-<osdId>-request.yaml
:apiVersion: lcm.mirantis.com/v1alpha1 kind: CephOsdRemoveRequest metadata: name: replace-failed-<nodeName>-osd-<osdId> namespace: ceph-lcm-mirantis spec: nodes: <nodeName>: cleanupByOsdId: - <osdId>
Apply the template to the cluster:
kubectl apply -f replace-failed-<nodeName>-osd-<osdId>-request.yaml
Verify that the corresponding request has been created:
kubectl -n ceph-lcm-mirantis get cephosdremoverequest
Verify that the
removeInfo
section appeared in theCephOsdRemoveRequest
CRstatus
:kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-<nodeName>-osd-<osdId>-request -o yaml
Example of system response
status: osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <dataDevice>: deviceClass: hdd devicePath: <dataDeviceByPath> devicePurpose: block usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2 zapDisk: true
Definition of values in angle brackets:
<machineName>
- name of the machine on which the device is being replaced, for example,worker-1
<nodeName>
- underlying node name of the machine, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
<osdId>
- Ceph OSD ID for the device being replaced, for example,1
<dataDeviceByPath>
-by-path
of the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
<dataDevice>
- name of the device placed on the node, for example,/dev/sdb
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-<nodeName>-osd-<osdId>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
CephOsdRemoveRequest
CR and set theapprove
flag totrue
:kubectl -n ceph-lcm-mirantis edit cephosdremoverequest replace-failed-<nodeName>-osd-<osdId>-request
For example:
spec: approve: true
Review the following
status
fields of the Ceph LCM CR request processing:status.phase
- current state of request processingstatus.messages
- description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- error and warning messages occurred during request processing, if any
Verify that the
CephOsdRemoveRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new device after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:
You can add a raw device to a node during node deployment.
If a node supports adding devices without node reboot, you can hot plug a raw device to a node.
If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:
Enable maintenance mode on the managed cluster.
Turn off the required node.
Attach the required raw device to the node.
Turn on the required node.
Disable maintenance mode on the managed cluster.
Open the
MiraCeph
CR on a MOSK cluster for editing:kubectl -n ceph-lcm-mirantis edit miraceph
In the
nodes
section, add a new device:spec: nodes: - name: <nodeName> devices: - fullPath: <deviceByID> # Recommended. Non-wwn by-id symlink. # name: <deviceByID> # Not recommended. If a device is supposed to be added with by-id. # fullPath: <deviceByPath> # Not recommended. If a device is supposed to be added with by-path. config: deviceClass: hdd
Substitute
<nodeName>
with the node name where device<deviceName>
or<deviceByPath>
is going to be added as a Ceph OSD.Verify that the new Ceph OSD has appeared in the Ceph cluster and is
in
andup
. ThefullClusterInfo
section should not contain any issues.kubectl -n ceph-lcm-mirantis get mchealth -o yaml
For example:
status: fullClusterInfo: daemonStatus: osd: running: '3/3 running: 3 up, 3 in' status: Ok