Replace a failed Ceph node¶
After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:
Remove the obsolete Ceph node from the Ceph cluster.
Add a new Ceph node with the same configuration to the Ceph cluster.
Note
Ceph OSD node replacement presupposes usage of a
KaaSCephOperationRequest
CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph node¶
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, remove the required device:spec: cephClusterSpec: nodes: <machineName>: # remove the entire entry for the node to replace storageDevices: {...} role: [...]
Substitute
<machineName>
with the machine name to replace.Save
KaaSCephCluster
and close the editor.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-<machineName>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-request namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: completeCleanUp: true kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute
<kaasCephClusterName>
with the correspondingKaaSCephCluster
resource from the<managedClusterProjectName>
namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
removeInfo
section appeared in theKaaSCephOperationRequest
CRstatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: ... <osdId>: deviceMapping: ... <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>
- machine name where the replacement occurs, for example,worker-1
.<nodeName>
- underlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
.<osdId>
- actual Ceph OSD ID for the device being replaced, for example,1
.<deviceName>
- actual device name placed on the node, for example,sdb
.<deviceByPath>
- actual deviceby-path
placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
.
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequest
CR and set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-request
For example:
spec: osdRemove: approve: true
Review the following
status
fields of theKaaSCephOperationRequest
CR request processing:status.phase
- current state of request processingstatus.messages
- description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new Ceph node after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, add a new device:spec: cephClusterSpec: nodes: <machineName>: # add new configuration for replaced Ceph node storageDevices: - fullPath: <deviceByID> # Recommended since MCC 2.25.0 (17.0.0), non-wwn by-id symlink # name: <deviceByID> # Prior MCC 2.25.0, non-wwn by-id symlink # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd ...
Substitute
<machineName>
with the machine name of the replaced node and configure it as required.Warning
Since MCC 2.25.0 (17.0.0), Mirantis highly recommends using non-wwn
by-id
symlinks only to specify storage devices in thestorageDevices
list.For details, see Container Cloud documentation: Addressing storage devices.
Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are
in
andup
. ThefullClusterInfo
section should not contain any issues.kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
Example of system response:
status: fullClusterInfo: clusterStatus: ceph: health: HEALTH_OK ... daemonStatus: mgr: running: a is active mgr status: Ok mon: running: '3/3 mons running: [a b c] in quorum' status: Ok osd: running: '3/3 running: 3 up, 3 in' status: Ok
Verify the Ceph node on the managed cluster:
kubectl -n rook-ceph get pod -o wide | grep <machineName>