Replace a failed Ceph OSD

Replace a failed Ceph OSDΒΆ

After a physical disk replacement, you can use Rook to redeploy a failed Ceph OSD by restarting rook-operator that triggers the reconfiguration of a managed cluster.

To redeploy a failed Ceph OSD:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of a managed cluster as described in Connect to a Mirantis Container Cloud cluster.

  3. Identify the failed Ceph OSD ID:

    ceph osd tree
    
  4. Remove the Ceph OSD deployment from the managed cluster:

    kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID>
    
  5. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  6. Remove the failed Ceph OSD from the Ceph cluster:

    ceph osd purge osd.<ID>
    
  7. Delete the authorization key of the failed Ceph OSD from the Ceph cluster:

    ceph auth del osd.<ID>
    
  8. Exit from the ceph-tools pod.

  9. Replace the failed disk.

  10. Restart the Rook operator:

    kubectl delete pod $(kubectl -n rook-ceph get pod -l "app=rook-ceph-operator" \
    -o jsonpath='{.items[0].metadata.name}') -n rook-ceph
    
  11. Once rook-operator restarts, verify the Ceph OSD pod state and select from the following options:

    • If the Ceph OSD pod is in the Running state, the Ceph OSD has been successfully replaced. Skip the steps below.

    • If the Ceph OSD pod is still in the CrashLoopBackOff state, switch it to the Running state:

      1. Once CrashLoopBackOff of the affected Ceph OSD pod occurs again, determine which node hosts this pod:

        kubectl -n rook-ceph get pod <failedCephOsdPod> -o jsonpath='{.spec.nodeName}'
        

        Substitute <failedCephOsdPod> with the name of the failed Ceph OSD pod.

      2. Determine the block device folder used by this Ceph OSD pod and save it as <CephOsdBlockDevice>:

        kubectl -n rook-ceph get pod <failedCephOsdPod> -o jsonpath='{.spec.volumes[?(@.name == "activate-osd")].hostPath.path}'
        

        Substitute <failedCephOsdPod> with the name of the failed Ceph OSD pod.

      3. SSH to the affected node and obtain the Ceph OSD authorization keyring:

        sudo su
        cat <CephOsdBlockDevice>/keyring
        

        Obtain the key value and save it as as <CephOsdKey>.

      4. Return to the cluster and enter the ceph-tools pod:

        kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
        
      5. Create a Ceph OSD keyring:

        vi key
        # inside editor:
        [osd.<ID>]
        key = <CephOsdKey>
        caps mgr = "allow profile osd"
        caps mon = "allow profile osd"
        caps osd = "allow *"
        # save and close the file
        
      6. Import the created Ceph OSD authorization key:

        ceph auth import -i key
        
      7. Verify that the key has been successfully imported:

        ceph auth ls
        
      8. Exit from the ceph-tools pod.

      9. Delete the CrashLoopBackOff Ceph OSD pod and wait until it is Running:

        kubectl -n rook-ceph delete pod <failedCephOsdPod>
        kubectl -n rook-ceph get pod -l app=rook-ceph-osd -w