Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!

Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.

Replace a failed Ceph node¶

Warning

This procedure is valid for MOSK clusters that use the MiraCeph custom resource (CR), which is available since MOSK 25.2 to replace the deprecated KaaSCephCluster. For the equivalent procedure with the KaaSCephCluster CR, refer to the following section:

Replace a failed Ceph node

After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:

Remove the obsolete Ceph node from the Ceph cluster.
Add a new Ceph node with the same configuration to the Ceph cluster.

Note

Ceph OSD node replacement presupposes usage of a CephOsdRemoveRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Remove a failed Ceph node¶

Open the MiraCeph CR on a MOSK cluster for editing:
```
kubectl -n ceph-lcm-mirantis edit miraceph
```
In the nodes section, remove the required device. When using device filters, update regexp accordingly.

For example:
```
spec:
  nodes:
  - name: <nodeName> # remove the entire entry for the node to replace
    devices: {...}
    role: [...]
```
Substitute <nodeName> with the node name to replace.
Save MiraCeph and close the editor.

Create a CephOsdRemoveRequest CR template and save it as replace-failed-<nodeName>-request.yaml:

apiVersion: lcm.mirantis.com/v1alpha1
kind: CephOsdRemoveRequest
metadata:
  name: replace-failed-<nodeName>-request
  namespace: ceph-lcm-mirantis
spec:
  nodes:
    <nodeName>:
      completeCleanUp: true

Apply the template to the cluster:

kubectl apply -f replace-failed-<nodeName>-request.yaml

Verify that the corresponding request has been created:

kubectl -n ceph-lcm-mirantis get cephosdremoverequest

Verify that the removeInfo section appeared in the CephOsdRemoveRequest CR status:

kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-<nodeName>-request -o yaml

Example of system response:

status:
  osdRemoveStatus:
    removeInfo:
      cleanUpMap:
        <nodeName>:
          osdMapping:
            ...
            <osdId>:
              deviceMapping:
                ...
                <deviceName>:
                  path: <deviceByPath>
                  partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                  type: "block"
                  class: "hdd"
                  zapDisk: true

If needed, change the following values:

<nodeName> - underlying machine node name, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af.
<osdId> - actual Ceph OSD ID for the device being replaced, for example, 1.
<deviceName> - actual device name placed on the node, for example, sdb.
<deviceByPath> - actual device by-path placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9.

Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:
```
kubectl -n ceph-lcm-mirantis get cephosdremoverequest replace-failed-<nodeName>-request -o yaml
```
Example of system response:
```
status:
  phase: ApproveWaiting
```

Edit the CephOsdRemoveRequest CR and set the approve flag to true:

kubectl -n ceph-lcm-mirantis edit cephosdremoverequest replace-failed-<nodeName>-request

For example:

spec:
  approve: true

Review the following status fields of the Ceph LCM CR request processing:
- status.phase - current state of request processing
- status.messages - description of the current phase
- status.conditions - full history of request processing before the current phase
- status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

Verify that the CephOsdRemoveRequest has been completed. For example:

status:
  phase: Completed # or CompletedWithWarnings if there are non-critical issues

Remove the device cleanup jobs:

kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks

Deploy a new Ceph node after removal of a failed one¶

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

Open the MiraCeph CR on a MOSK cluster for editing:
```
kubectl -n ceph-lcm-mirantis edit miraceph
```

In the nodes section, add a new device:

spec:
  nodes:
  - name: <nodeName> # add new configuration for replaced Ceph node
    devices:
    - fullPath: <deviceByID> # Recommended. Non-wwn by-id symlink.
      # name: <deviceByID> # Not recommended. If a device is supposed to be added with by-id.
      # fullPath: <deviceByPath> # if device is supposed to be added with by-path.
      config:
        deviceClass: hdd
      ...

Substitute <nodeName> with the replaced node name and configure it as required.

Warning

Mirantis highly recommends using non-wwn by-id symlinks only to specify storage devices in the devices list.

For details, see Addressing storage devices since MOSK 25.2.

Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are in and up. The fullClusterInfo section should not contain any issues.

kubectl -n ceph-lcm-mirantis get mchealth -o yaml

Example of system response:

status:
  fullClusterInfo:
    clusterStatus:
      ceph:
        health: HEALTH_OK
        ...
    daemonStatus:
      mgr:
        running: a is active mgr
        status: Ok
      mon:
        running: '3/3 mons running: [a b c] in quorum'
        status: Ok
      osd:
        running: '3/3 running: 3 up, 3 in'
        status: Ok

Verify the Ceph node on the MOSK cluster:

kubectl -n rook-ceph get pod -o wide | grep <nodeName>