Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set will cover all product layers, including MOSK management (formerly MCC). This means everything you need will be in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Replace a failed Ceph node¶
After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:
Remove the obsolete Ceph node from the Ceph cluster.
Add a new Ceph node with the same configuration to the Ceph cluster.
Note
Ceph OSD node replacement presupposes usage of a
KaaSCephOperationRequest
CR. For workflow overview, spec and phases
description, see High-level workflow of Ceph OSD or node removal.
Remove a failed Ceph node¶
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, remove the required device:spec: cephClusterSpec: nodes: <machineName>: # remove the entire entry for the node to replace storageDevices: {...} role: [...]
Substitute
<machineName>
with the machine name to replace.Save
KaaSCephCluster
and close the editor.Create a
KaaSCephOperationRequest
CR template and save it asreplace-failed-<machineName>-request.yaml
:apiVersion: kaas.mirantis.com/v1alpha1 kind: KaaSCephOperationRequest metadata: name: replace-failed-<machineName>-request namespace: <managedClusterProjectName> spec: osdRemove: nodes: <machineName>: completeCleanUp: true kaasCephCluster: name: <kaasCephClusterName> namespace: <managedClusterProjectName>
Substitute
<kaasCephClusterName>
with the correspondingKaaSCephCluster
resource from the<managedClusterProjectName>
namespace.Apply the template to the cluster:
kubectl apply -f replace-failed-<machineName>-request.yaml
Verify that the corresponding request has been created:
kubectl get kaascephoperationrequest -n <managedClusterProjectName>
Verify that the
removeInfo
section appeared in theKaaSCephOperationRequest
CRstatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: childNodesMapping: <nodeName>: <machineName> osdRemoveStatus: removeInfo: cleanUpMap: <nodeName>: osdMapping: ... <osdId>: deviceMapping: ... <deviceName>: path: <deviceByPath> partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb" type: "block" class: "hdd" zapDisk: true
If needed, change the following values:
<machineName>
- machine name where the replacement occurs, for example,worker-1
.<nodeName>
- underlying machine node name, for example,kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af
.<osdId>
- actual Ceph OSD ID for the device being replaced, for example,1
.<deviceName>
- actual device name placed on the node, for example,sdb
.<deviceByPath>
- actual deviceby-path
placed on the node, for example,/dev/disk/by-path/pci-0000:00:1t.9
.
Verify that the
cleanUpMap
section matches the required removal and wait for theApproveWaiting
phase to appear instatus
:kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
Example of system response:
status: phase: ApproveWaiting
Edit the
KaaSCephOperationRequest
CR and set theapprove
flag totrue
:kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-request
For example:
spec: osdRemove: approve: true
Review the following
status
fields of theKaaSCephOperationRequest
CR request processing:status.phase
- current state of request processingstatus.messages
- description of the current phasestatus.conditions
- full history of request processing before the current phasestatus.removeInfo.issues
andstatus.removeInfo.warnings
- error and warning messages occurred during request processing, if any
Verify that the
KaaSCephOperationRequest
has been completed. For example:status: phase: Completed # or CompletedWithWarnings if there are non-critical issues
Remove the device cleanup jobs:
kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
Deploy a new Ceph node after removal of a failed one¶
Note
You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.
Open the
KaasCephCluster
CR of a managed cluster for editing:kubectl edit kaascephcluster -n <managedClusterProjectName>
Substitute
<managedClusterProjectName>
with the corresponding value.In the
nodes
section, add a new device:spec: cephClusterSpec: nodes: <machineName>: # add new configuration for replaced Ceph node storageDevices: - fullPath: <deviceByID> # Recommended since MCC 2.25.0 (17.0.0), non-wwn by-id symlink # name: <deviceByID> # Prior MCC 2.25.0, non-wwn by-id symlink # fullPath: <deviceByPath> # if device is supposed to be added with by-path config: deviceClass: hdd ...
Substitute
<machineName>
with the machine name of the replaced node and configure it as required.Warning
Since MCC 2.25.0 (17.0.0), Mirantis highly recommends using non-wwn
by-id
symlinks only to specify storage devices in thestorageDevices
list.For details, see Container Cloud documentation: Addressing storage devices.
Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are
in
andup
. ThefullClusterInfo
section should not contain any issues.kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
Example of system response:
status: fullClusterInfo: clusterStatus: ceph: health: HEALTH_OK ... daemonStatus: mgr: running: a is active mgr status: Ok mon: running: '3/3 mons running: [a b c] in quorum' status: Ok osd: running: '3/3 running: 3 up, 3 in' status: Ok
Verify the Ceph node on the managed cluster:
kubectl -n rook-ceph get pod -o wide | grep <machineName>