Replace a failed Ceph node¶
After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:
Remove the obsolete Ceph node from the Ceph cluster.
Add a new Ceph node with the same configuration to the Ceph cluster.
Note
Ceph OSD node replacement presupposes usage of a
KaaSCephOperationRequest
CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph node¶
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, remove the required device:spec: cephClusterSpec: nodes: <machineName>: # remove the entire entry for the node to replace storageDevices: {...} role: [...]
Substitute
<machineName>
with the machine name to replace.Save
KaaSCephCluster
and close the editor.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-<machineName>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-request namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: completeCleanUp: true kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute
<kaasCephClusterName>
with the correspondingKaaSCephCluster
resource from the<managedClusterProjectName>
namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
removeInfo
section appeared in theKaaSCephOperationRequest
CRstatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: ... <osdId>: deviceMapping: ... <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>
- machine name where the replacement occurs, for example,worker-1
.<nodeName>
- underlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
.<osdId>
- actual Ceph OSD ID for the device being replaced, for example,1
.<deviceName>
- actual device name placed on the node, for example,sdb
.<deviceByPath>
- actual deviceby-path
placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
.
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequest
CR and set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-request
For example:
spec: osdRemove: approve: true
Review the status of the
KaaSCephOperationRequest
resource request processing. The valuable parameters are as follows:status.phase
- the current state of request processingstatus.messages
- the description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- contain error and warning messages occurred during request processing
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new Ceph node after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, add a new device:spec: cephClusterSpec: nodes: <machineName>: # add new configuration for replaced Ceph node storageDevices: - fullPath: <deviceByID> # Recommended since Container Cloud 2.25.0, non-wwn by-id symlink # name: <deviceByID> # Prior Container Cloud 2.25.0, non-wwn by-id symlink # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd ...
Substitute
<machineName>
with the machine name of the replaced node and configure it as required.Warning
Since Container Cloud 2.25.0, Mirantis highly recommends using non-wwn
by-id
symlinks only to specify storage devices in thestorageDevices
list.For details, see Addressing storage devices.
Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are
in
andup
. ThefullClusterInfo
section should not contain any issues.kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
Example of system response:
status: fullClusterInfo: clusterStatus: ceph: health: HEALTH_OK ... daemonStatus: mgr: running: a is active mgr status: Ok mon: running: '3/3 mons running: [a b c] in quorum' status: Ok osd: running: '3/3 running: 3 up, 3 in' status: Ok
Verify the Ceph node on the managed cluster:
kubectl -n rook-ceph get pod -o wide | grep <machineName>