Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Replace a failed Ceph node¶
Warning
This procedure is valid for MOSK clusters that use the deprecated
KaaSCephCluster custom resource (CR) instead of the MiraCeph CR that is
available since MOSK 25.2 as a new Ceph configuration entrypoint. For the
equivalent procedure with the MiraCeph CR, refer to the following section:
After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:
Remove the obsolete Ceph node from the Ceph cluster.
Add a new Ceph node with the same configuration to the Ceph cluster.
Note
Ceph OSD node replacement presupposes usage of a
KaaSCephOperationRequest CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph node¶
Open the
KaasCephClusterCR of a MOSK cluster for editing:kubectl edit kaascephcluster -n <moskClusterProjectName>
Substitute
<moskClusterProjectName>with the corresponding value.In the
nodessection, remove the required device or update thestorageDeviceFilterregexp accordingly. For example:spec: cephClusterSpec: nodes: <machineName>: # remove the entire entry for the node to replace storageDevices: {...} role: [...]
Substitute
<machineName>with the machine name to replace.Save
KaaSCephClusterand close the editor.Create a
KaaSCephOperationRequestCR template and save it asreplace-failed-<machineName>-request.yaml:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-request namespace: <moskClusterProjectName> spec: osdRemove: nodes: <machineName>: completeCleanUp: true kaasCephCluster: name: <kaasCephClusterName> namespace: <moskClusterProjectName>
Substitute
<kaasCephClusterName>with the correspondingKaaSCephClusterresource from the<moskClusterProjectName>namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <moskClusterProjectName>
Verify that the
removeInfosection appeared in theKaaSCephOperationRequestCRstatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: ... <osdId>: deviceMapping: ... <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>- machine name where the replacement occurs, for example,worker-1.<nodeName>- underlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af.<osdId>- actual Ceph OSD ID for the device being replaced, for example,1.<deviceName>- actual device name placed on the node, for example,sdb.<deviceByPath>- actual deviceby-pathplaced on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9.
Verify that the
cleanUpMapsection matches the required removal and wait for theApproveWaitingphase to appear instatus:kubectl -n <moskClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequestCR and set theapproveflag totrue:kubectl -n <moskClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-request
For example:
spec: osdRemove: approve: true
Review the following
statusfields of the Ceph LCM CR request processing:status.phase- current state of request processingstatus.messages- description of the current phasestatus.conditions- full history of request processing before the current phasestatus.removeInfo.issuesandstatus.removeInfo.warnings- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequesthas been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new Ceph node after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
Open the
KaasCephClusterCR of a MOSK cluster for editing:kubectl edit kaascephcluster -n <moskClusterProjectName>
Substitute
<moskClusterProjectName>with the corresponding value.In the
nodessection, add a new device:spec: cephClusterSpec: nodes: <machineName>: # add new configuration for replaced Ceph node storageDevices: - fullPath: <deviceByID> # Recommended since MCC 2.25.0 (17.0.0), non-wwn by-id symlink # name: <deviceByID> # Prior MCC 2.25.0, non-wwn by-id symlink # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd ...
Substitute
<machineName>with the machine name of the replaced node and configure it as required.Warning
Since MOSK 23.3, Mirantis highly recommends using non-wwn
by-idsymlinks only to specify storage devices in thestorageDeviceslist. For details, see Addressing storage devices prior to MOSK 25.2.Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are
inandup. ThefullClusterInfosection should not contain any issues.kubectl -n <moskClusterProjectName> get kaascephcluster -o yaml
Example of system response:
status: fullClusterInfo: clusterStatus: ceph: health: HEALTH_OK ... daemonStatus: mgr: running: a is active mgr status: Ok mon: running: '3/3 mons running: [a b c] in quorum' status: Ok osd: running: '3/3 running: 3 up, 3 in' status: Ok
Verify the Ceph node on the MOSK cluster:
kubectl -n rook-ceph get pod -o wide | grep <machineName>