Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Now, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Replace a failed Ceph OSD¶
Warning
This procedure is valid for MOSK clusters that use the deprecated KaaSCephCluster custom resource (CR) instead of the
MiraCeph CR that is available since MOSK 25.2 as a new Ceph configuration
entrypoint. For the equivalent procedure with the MiraCeph CR, refer to the
following section:
After a physical disk replacement, you can use Ceph LCM API to redeploy a failed Ceph OSD. The common flow of replacing a failed Ceph OSD is as follows:
Remove the obsolete Ceph OSD from the Ceph cluster by device name, by Ceph OSD ID, or by path.
Add a new Ceph OSD on the new disk to the Ceph cluster.
Note
Ceph OSD replacement presupposes usage of a
KaaSCephOperationRequest CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph OSD by device name, path, or ID¶
Warning
The procedure below presuppose that the operator knows the exact
device name, by-path, or by-id of the replaced device, as well as on
which node the replacement occurred.
Warning
A Ceph OSD removal using by-path, by-id, or device name is
not supported if a device was physically removed from a node. Therefore, use
cleanupByOsdId instead. For details, see
Remove a failed Ceph OSD by Ceph OSD ID.
Warning
Mirantis does not recommend setting device name or device
by-path symlink in the cleanupByDevice field as these identifiers
are not persistent and can change at node boot. Remove Ceph OSDs with
by-id symlinks specified in the path field or use cleanupByOsdId
instead. For details, see Addressing storage devices using MiraCeph (current).
Open the
KaasCephClusterCR of a MOSK cluster for editing:kubectl edit kaascephcluster -n <moskClusterProjectName>
Substitute
<moskClusterProjectName>with the corresponding value.In the
nodessection, remove the required device fromstorageDevicesor updatestorageDeviceFilterregexp accordingly. For example:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # remove the entire item from storageDevices list # fullPath: <deviceByPath> if device is specified with symlink instead of name config: deviceClass: hdd
Substitute
<machineName>with the machine name of the node where the device<deviceName>or<deviceByPath>is going to be replaced.Save
KaaSCephClusterand close the editor.Create a
KaaSCephOperationRequestCR template and save it asreplace-failed-osd-<machineName>-<deviceName>-request.yaml:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-osd-<machineName>-<deviceName> namespace: <moskClusterProjectName> spec: osdRemove: nodes: <machineName>: cleanupByDevice: - name: <deviceName> # If a device is specified with by-path or by-id instead of # name, path: <deviceByPath> or <deviceById>. kaasCephCluster: name: <kaasCephClusterName> namespace: <moskClusterProjectName>
Substitute
<kaasCephClusterName>with the correspondingKaaSCephClusterresource from the<moskClusterProjectName>namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-osd-<machineName>-<deviceName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <moskClusterProjectName>
Verify that the
removeInfosection appeared in theKaaSCephOperationRequestCRstatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <dataDevice>: deviceClass: hdd devicePath: <dataDeviceByPath> devicePurpose: block usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2 zapDisk: true
Definition of values in angle brackets:
<machineName>- name of the machine on which the device is being replaced, for example,worker-1<nodeName>- underlying node name of the machine, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af<osdId>- Ceph OSD ID for the device being replaced, for example,1<dataDeviceByPath>-by-pathof the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9<dataDevice>- name of the device placed on the node, for example,/dev/sdb
Verify that the
cleanUpMapsection matches the required removal and wait for theApproveWaitingphase to appear instatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequestCR and set theapproveflag totrue:kubectl -n <moskClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName>
For example:
spec: osdRemove: approve: true
Review the following
statusfields of the Ceph LCM CR request processing:status.phase- current state of request processingstatus.messages- description of the current phasestatus.conditions- full history of request processing before the current phasestatus.removeInfo.issuesandstatus.removeInfo.warnings- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequesthas been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Remove a failed Ceph OSD by Ceph OSD ID¶
Caution
The procedure below presupposes that the operator knows only the failed Ceph OSD ID.
Identify the node and device names used by the affected Ceph OSD. Using the Ceph CLI in the
rook-ceph-toolsPod, run:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd metadata <osdId>
Substitute
<osdId>with the affected OSD ID.Example output:
{ "id": 1, ... "bluefs_db_devices": "vdc", ... "bluestore_bdev_devices": "vde", ... "devices": "vdc,vde", ... "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf", ... },
In the example above,
hostnameis the node name anddevicesare all devices used by the affected Ceph OSD.Obtain
<machineName>for<nodeName>where the Ceph OSD is placed:kubectl -n rook-ceph get node -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.kaas\.mirantis\.com\/machine-name}{"\n"}{end}'
Open the
KaasCephClusterCR of a MOSK cluster for editing:kubectl edit kaascephcluster -n <moskClusterProjectName>
Substitute
<moskClusterProjectName>with the corresponding value.In the
nodessection, remove the required device fromstorageDevicesor update thestorageDeviceFilterregexp accordingly. For example:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - name: <deviceName> # remove the entire item from storageDevices list config: deviceClass: hdd
Substitute
<machineName>with the machine name of the node where the device<deviceName>is going to be replaced.Save
KaaSCephClusterand close the editor.Create a
KaaSCephOperationRequestCR template and save it asreplace-failed-<machineName>-osd-<osdId>-request.yaml:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-osd-<osdId> namespace: <moskClusterProjectName> spec: osdRemove: nodes: <machineName>: cleanupByOsdId: - <osdId> kaasCephCluster: name: <kaasCephClusterName> namespace: <moskClusterProjectName>
Substitute
<kaasCephClusterName>with the correspondingKaaSCephClusterresource from the<moskClusterProjectName>namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-osd-<osdId>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <moskClusterProjectName>
Verify that the
removeInfosection appeared in theKaaSCephOperationRequestCRstatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
Example of system response
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: <osdId>: deviceMapping: <dataDevice>: deviceClass: hdd devicePath: <dataDeviceByPath> devicePurpose: block usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2 zapDisk: true
Definition of values in angle brackets:
<machineName>- name of the machine on which the device is being replaced, for example,worker-1<nodeName>- underlying node name of the machine, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af<osdId>- Ceph OSD ID for the device being replaced, for example,1<dataDeviceByPath>-by-pathof the device placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9<dataDevice>- name of the device placed on the node, for example,/dev/sdb
Verify that the
cleanUpMapsection matches the required removal and wait for theApproveWaitingphase to appear instatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequestCR and set theapproveflag totrue:kubectl -n <moskClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request
For example:
spec: osdRemove: approve: true
Review the following
statusfields of the Ceph LCM CR request processing:status.phase- current state of request processingstatus.messages- description of the current phasestatus.conditions- full history of request processing before the current phasestatus.removeInfo.issuesandstatus.removeInfo.warnings- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequesthas been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new device after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:
You can add a raw device to a node during node deployment.
If a node supports adding devices without node reboot, you can hot plug a raw device to a node.
If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:
Enable maintenance mode on the MOSK cluster.
Turn off the required node.
Attach the required raw device to the node.
Turn on the required node.
Disable maintenance mode on the MOSK cluster.
Open the
KaasCephClusterCR of a MOSK cluster for editing:kubectl edit kaascephcluster -n <moskClusterProjectName>
Substitute
<moskClusterProjectName>with the corresponding value.In the
nodessection, add new required device tostorageDevicesor updatestorageDeviceFilterregexp accordingly. For example:spec: cephClusterSpec: nodes: <machineName>: storageDevices: - fullPath: <deviceByID> # if device is supposed to be added with by-id # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd
Substitute
<machineName>with the machine name of the node where device<deviceByID>or<deviceByPath>is going to be added as a Ceph OSD.Verify that the new Ceph OSD has appeared in the Ceph cluster and is
inandup. ThefullClusterInfosection should not contain any issues.kubectl -n <moskClusterProjectName> get kaascephcluster -o yaml
For example:
status: fullClusterInfo: daemonStatus: osd: running: '3/3 running: 3 up, 3 in' status: Ok