Replace a failed Ceph OSD

After a physical disk replacement, you can use Ceph LCM API to redeploy a failed Ceph OSD. The common flow of replacing a failed Ceph OSD is as follows:

  1. Remove the obsolete Ceph OSD from the Ceph cluster by device name, by Ceph OSD ID, or by path.

  2. Add a new Ceph OSD on the new disk to the Ceph cluster.

Note

Ceph OSD replacement presupposes usage of a KaaSCephOperationRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Remove a failed Ceph OSD by device name, path, or ID

Warning

The procedure below presuppose that the Operator knows the exact device name, by-path, or by-id of the replaced device, as well as on which node the replacement occurred.

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, remove the required device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from storageDevices list
              # fullPath: <deviceByPath> if device is specified with by-path instead of name
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where the device <deviceName> or <deviceByPath> is going to be replaced.

  3. Save KaaSCephCluster and close the editor.

  4. Create a KaaSCephOperationRequest CR template and save it as replace-failed-osd-<machineName>-<deviceName>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-osd-<machineName>-<deviceName>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByDevice:
            - name: <deviceName>
              # If a device is specified with by-path or by-id (since Container
              # Cloud 2.19.0 for non-MOSK clusters) instead of
              # name, path: <deviceByPath> or <deviceById>.
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <kaasCephClusterName> with the corresponding KaaSCephCluster resource from the <managedClusterProjectName> namespace.

  5. Apply the template to the cluster:

    kubectl apply -f replace-failed-osd-<machineName>-<deviceName>-request.yaml
    
  6. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  7. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
    

    Example of system response:

    status:
      childNodesMapping:
        <nodeName>: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            <nodeName>:
              osdMapping:
                <osdId>:
                  deviceMapping:
                    <deviceName>:
                      path: <deviceByPath>
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
    

    If needed, change the following values:

    • <machineName> - machine name where the replacement occurs, for example, worker-1.

    • <nodeName> - underlying machine node name, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af.

    • <osdId> - actual Ceph OSD ID for the device being replaced, for example, 1.

    • <deviceName> - actual device name placed on the node, for example, sdb.

    • <deviceByPath> - actual device by-path placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9.

  8. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  9. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName>
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  10. Review the status of the KaaSCephOperationRequest resource request processing. The valuable parameters are as follows:

    • status.phase - the current state of request processing

    • status.messages - the description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - contain error and warning messages occurred during request processing

  11. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  12. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    

Remove a failed Ceph OSD by Ceph OSD ID

Warning

The procedure below presupposes that the Operator knows only the failed Ceph OSD ID.

  1. Obtain the osd-device mapping from the status section of the KaaSCephCluster CR:

    kubectl get kaascephcluster -n <managedClusterProjectName> -o yaml
    

    Substitute <managedClusterProjectName> with the corresponding value.

    For example:

    status:
      fullClusterInfo:
        cephDetails:
          cephDeviceMapping:
            <nodeName>:
              <osdId>: <deviceName>
    
    • <nodeName> - the corresponding node name that hosts the Ceph OSD

    • <osdId> - the ID of the Ceph OSD to replace

    • <deviceName> - an actual device name to replace

  2. Obtain <machineName> for <nodeName> where the Ceph OSD is placed:

    kubectl -n rook-ceph get node -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.kaas\.mirantis\.com\/machine-name}{"\n"}{end}'
    
  3. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  4. In the nodes section, remove the required device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from storageDevices list
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where the device <deviceName> is going to be replaced.

  5. Save KaaSCephCluster and close the editor.

  6. Create a KaaSCephOperationRequest CR template and save it as replace-failed-<machineName>-osd-<osdId>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-<machineName>-osd-<osdId>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByOsdId:
            - <osdId>
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <kaasCephClusterName> with the corresponding KaaSCephCluster resource from the <managedClusterProjectName> namespace.

  7. Apply the template to the cluster:

    kubectl apply -f replace-failed-<machineName>-osd-<osdId>-request.yaml
    
  8. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  9. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
    

    Example of system response

    status:
      childNodesMapping:
        <nodeName>: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            <nodeName>:
              osdMapping:
                <osdId>:
                  deviceMapping:
                    <deviceName>:
                      path: <deviceByPath>
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
    

    If needed, change the following values:

    • <machineName> - machine name where the replacement occurs, for example, worker-1.

    • <nodeName> - nderlying machine node name, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af.

    • <osdId> - actual Ceph OSD ID for the device being replaced, for example, 1.

    • <deviceName> - actual device name placed on the node, for example, sdb.

    • <deviceByPath> - actual device by-path placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9.

  10. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  11. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  12. Review the status of the KaaSCephOperationRequest resource request processing. The valuable parameters are as follows:

    • status.phase - the current state of request processing

    • status.messages - the description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - contain error and warning messages occurred during request processing

  13. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  14. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    

Deploy a new device after removal of a failed one

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, add a new device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>
              # fullPath: <deviceByPath> # if device is supposed to be added with by-path
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where device <deviceName> or <deviceByPath> is going to be added as a Ceph OSD.

  3. Verify that the new Ceph OSD has appeared in the Ceph cluster and is in and up. The fullClusterInfo section should not contain any issues.

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    

    For example:

    status:
      fullClusterInfo:
        daemonStatus:
          osd:
            running: '3/3 running: 3 up, 3 in'
            status: Ok