Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Ceph disaster recovery¶
Warning
This procedure is valid for MOSK clusters that use the deprecated
KaaSCephCluster custom resource (CR) instead of the MiraCeph CR that is
available since MOSK 25.2 as a new Ceph configuration entrypoint. For the
equivalent procedure with the MiraCeph CR, refer to the following section:
This section describes how to recover a failed or accidentally removed Ceph cluster in the following cases:
- If Ceph Controller underlying a running Rook Ceph cluster has failed and you want to install a new Ceph Controller Helm release and recover the failed Ceph cluster onto the new Ceph Controller. 
- To migrate the data of an existing Ceph cluster to a new deployment in case downtime can be tolerated. 
Consider the common state of a failed or removed Ceph cluster:
- The - rook-cephnamespace does not contain pods or they are in the- Terminatingstate.
- The - rook-cephor/and- ceph-lcm-mirantisnamespaces are in the- Terminatingstate.
- The - ceph-operatoris in the- FAILEDstate:- Management cluster: the state of the - ceph-operatorHelm release in the management HelmBundle, such as- default/kaas-mgmt, has switched from- DEPLOYEDto- FAILED.
- MOSK cluster: the state of the - osh-system/ceph-operatorHelmBundle, or a related namespace, has switched from- DEPLOYEDto- FAILED.
 
- The Rook - CephCluster,- CephBlockPool,- CephObjectStoreCRs in the- rook-cephnamespace cannot be found or have the- deletionTimestampparameter in the- metadatasection.
Note
Prior to recovering the Ceph cluster, verify that your deployment meets the following prerequisites:
- The Ceph cluster - fsidexists.
- The Ceph cluster Monitor keyrings exist. 
- The Ceph cluster devices exist and include the data previously handled by Ceph OSDs. 
Ceph cluster recovery workflow¶
- Create a backup of the remaining data and resources. 
- Clean up the failed or removed - ceph-operatorHelm release.
- Deploy a new - ceph-operatorHelm release with the previously used- KaaSCephClusterand one Ceph Monitor.
- Replace the - ceph-mondata with the old cluster data.
- Replace - fsidin- secrets/rook-ceph-monwith the old one.
- Fix the Monitor map in the - ceph-mondatabase.
- Fix the Ceph Monitor authentication key and disable authentication. 
- Start the restored cluster and inspect the recovery. 
- Fix the admin authentication key and enable authentication. 
- Restart the cluster. 
Recover a failed or removed Ceph cluster¶
- Back up the remaining resources. Skip the commands for the resources that have already been removed: - kubectl -n rook-ceph get cephcluster <clusterName> -o yaml > backup/cephcluster.yaml # perform this for each cephblockpool kubectl -n rook-ceph get cephblockpool <cephBlockPool-i> -o yaml > backup/<cephBlockPool-i>.yaml # perform this for each client kubectl -n rook-ceph get cephclient <cephclient-i> -o yaml > backup/<cephclient-i>.yaml kubectl -n rook-ceph get cephobjectstore <cephObjectStoreName> -o yaml > backup/<cephObjectStoreName>.yaml # perform this for each secret kubectl -n rook-ceph get secret <secret-i> -o yaml > backup/<secret-i>.yaml # perform this for each configMap kubectl -n rook-ceph get cm <cm-i> -o yaml > backup/<cm-i>.yaml 
- SSH to each node where the Ceph Monitors or Ceph OSDs were placed before the failure and back up the valuable data: - mv /var/lib/rook /var/lib/rook.backup mv /etc/ceph /etc/ceph.backup mv /etc/rook /etc/rook.backup - Once done, close the SSH connection. 
- Clean up the previous installation of - ceph-operator. For details, see Rook documentation: Cleaning up a cluster.- Delete the - ceph-lcm-mirantis/ceph-controllerdeployment:- kubectl -n ceph-lcm-mirantis delete deployment ceph-controller 
- Delete all deployments, DaemonSets, and jobs from the - rook-cephnamespace, if any:- kubectl -n rook-ceph delete deployment --all kubectl -n rook-ceph delete daemonset --all kubectl -n rook-ceph delete job --all 
- Edit the - MiraCephand- MiraCephHealthCRs of the- ceph-lcm-mirantisnamespace and remove the- finalizerparameter from the- metadatasection:- kubectl -n ceph-lcm-mirantis edit miraceph kubectl -n ceph-lcm-mirantis edit miracephhealth - Note - Before MOSK 25.1, use - MiraCephLoginstead of- MiraCephHealthas the object name and in the command above.
- Edit the - CephCluster,- CephBlockPool,- CephClient, and- CephObjectStoreCRs of the- rook-cephnamespace and remove the- finalizerparameter from the- metadatasection:- kubectl -n rook-ceph edit cephclusters kubectl -n rook-ceph edit cephblockpools kubectl -n rook-ceph edit cephclients kubectl -n rook-ceph edit cephobjectstores kubectl -n rook-ceph edit cephobjectusers 
- Once you clean up every single resource related to the Ceph release, open the - ClusterCR for editing:- kubectl -n <projectName> edit cluster <clusterName> - Substitute - <projectName>with- defaultfor the management cluster or with a related project name for the MOSK cluster.
- Remove the - ceph-controllerHelm release item from the- spec.providerSpec.value.helmReleasesarray and save the- ClusterCR:- - name: ceph-controller values: {} 
- Verify that - ceph-controllerhas disappeared from the corresponding HelmBundle:- kubectl -n <projectName> get helmbundle -o yaml 
 
- Open the - KaaSCephClusterCR of the related management or MOSK cluster for editing:- kubectl -n <projectName> edit kaascephcluster - Substitute - <projectName>with- defaultfor the management cluster or with a related project name for the MOSK cluster.
- Edit the roles of nodes. The entire - nodesspec must contain only one- monrole. Save- KaaSCephClusterafter editing.
- Open the - ClusterCR for editing:- kubectl -n <projectName> edit cluster <clusterName> - Substitute - <projectName>with- defaultfor the management cluster or with a related project name for the MOSK cluster.
- Add - ceph-controllerto- spec.providerSpec.value.helmReleasesto restore the- ceph-controllerHelm release. Save- Clusterafter editing.- - name: ceph-controller values: {} 
- Verify that the - ceph-controllerHelm release is deployed:- Inspect the Rook Operator logs and wait until the orchestration has settled: - kubectl -n rook-ceph logs -l app=rook-ceph-operator 
- Verify that the pods in the - rook-cephnamespace have- rook-ceph-mon-a,- rook-ceph-mgr-a, and all the auxiliary pods ar up and running, and no- rook-ceph-osd-ID-xxxxxxare running:- kubectl -n rook-ceph get pod 
- Verify the Ceph state. The output must indicate that one - monand one- mgrare running, all Ceph OSDs are down, and all PGs are in the- Unknownstate.- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s - Note - Rook should not start any Ceph OSD daemon because all devices belong to the old cluster that has a different - fsid. To verify the Ceph OSD daemons, inspect the- osd-preparepods logs:- kubectl -n rook-ceph logs -l app=rook-ceph-osd-prepare 
 
- Connect to the terminal of the - rook-ceph-mon-apod:- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \ -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}') bash 
- Output the - keyringfile and save it for further usage:- cat /etc/ceph/keyring-store/keyring exit 
- Obtain and save the - nodeNameof- mon-afor further usage:- kubectl -n rook-ceph get pod $(kubectl -n rook-ceph get pod \ -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}') -o jsonpath='{.spec.nodeName}' 
- Obtain and save the - cephImageused in the Ceph cluster for further usage:- kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.cephImage}' 
- Stop Rook Operator and scale the deployment replicas to - 0:- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0 
- Remove the Rook deployments generated with Rook Operator: - kubectl -n rook-ceph delete deploy -l app=rook-ceph-mon kubectl -n rook-ceph delete deploy -l app=rook-ceph-mgr kubectl -n rook-ceph delete deploy -l app=rook-ceph-osd kubectl -n rook-ceph delete deploy -l app=rook-ceph-crashcollector 
- Using the saved - nodeName, SSH to the host where- rook-ceph-mon-ain the new Kubernetes cluster is placed and perform the following steps:- Remove - /var/lib/rook/mon-aor copy it to another folder:- mv /var/lib/rook/mon-a /var/lib/rook/mon-a.new 
- Pick a healthy - rook-ceph-mon-IDdirectory (- /var/lib/rook.backup/mon-ID) in the previous backup, copy to- /var/lib/rook/mon-a:- cp -rp /var/lib/rook.backup/mon-<ID> /var/lib/rook/mon-a - Substitute - IDwith any healthy- monnode ID of the old cluster.
- Replace - /var/lib/rook/mon-a/keyringwith the previously saved keyring, preserving only the- [mon.]section. Remove the- [client.admin]section.
- Run the - cephImageDocker container using the previously saved- cephImageimage:- docker run -it --rm -v /var/lib/rook:/var/lib/rook <cephImage> bash 
- Inside the container, create - /etc/ceph/ceph.conffor a stable operation of- ceph-mon:- touch /etc/ceph/ceph.conf
- Change the directory to - /var/lib/rookand edit- monmapby replacing the existing- monhosts with the new- mon-aendpoints:- cd /var/lib/rook rm /var/lib/rook/mon-a/data/store.db/LOCK # make sure the quorum lock file does not exist ceph-mon --extract-monmap monmap --mon-data ./mon-a/data # Extract monmap from old ceph-mon db and save as monmap monmaptool --print monmap # Print the monmap content, which reflects the old cluster ceph-mon configuration. monmaptool --rm a monmap # Delete `a` from monmap. monmaptool --rm b monmap # Repeat, and delete `b` from monmap. monmaptool --rm c monmap # Repeat this pattern until all the old ceph-mons are removed and monmap won't be empty monmaptool --addv a [v2:<nodeIP>:3300,v1:<nodeIP>:6789] monmap # Replace it with the rook-ceph-mon-a address you got from previous command. ceph-mon --inject-monmap monmap --mon-data ./mon-a/data # Replace monmap in ceph-mon db with our modified version. rm monmap exit - Substitute - <nodeIP>with the IP address of the current- <nodeName>node.
- Close the SSH connection. 
 
- Change - fsidto the original one to run Rook as an old cluster:- kubectl -n rook-ceph edit secret/rook-ceph-mon - Note - The - fsidis- base64encoded and must not contain a trailing carriage return. For example:- echo -n a811f99a-d865-46b7-8f2c-f94c064e4356 | base64 # Replace with the fsid from the old cluster. 
- Scale the - ceph-lcm-mirantis/ceph-controllerdeployment replicas to- 0:- kubectl -n ceph-lcm-mirantis scale deployment ceph-controller --replicas 0 
- Disable authentication: - Open the - cm/rook-config-overrideConfigMap for editing:- kubectl -n rook-ceph edit cm/rook-config-override 
- Add the following content: - data: config: | [global] ... auth cluster required = none auth service required = none auth client required = none auth supported = none 
 
- Start Rook Operator by scaling its deployment replicas to - 1:- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1 
- Inspect the Rook Operator logs and wait until the orchestration has settled: - kubectl -n rook-ceph logs -l app=rook-ceph-operator 
- Verify that the pods in the - rook-cephnamespace have the- rook-ceph-mon-a,- rook-ceph-mgr-a, and all the auxiliary pods are up and running, and all- rook-ceph-osd-ID-xxxxxxgreater than zero are running:- kubectl -n rook-ceph get pod 
- Verify the Ceph state. The output must indicate that one - mon, one- mgr, and all Ceph OSDs are up and running and all PGs are either in the- Activeor- Degradedstate:- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \ -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s 
- Enter the - ceph-toolspod and import the authentication key:- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \ -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash vi key [paste keyring content saved before, preserving only `[client admin]` section] ceph auth import -i key rm key exit 
- Stop Rook Operator by scaling the deployment to - 0replicas:- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0 
- Re-enable authentication: - Open the - cm/rook-config-overrideConfigMap for editing:- kubectl -n rook-ceph edit cm/rook-config-override 
- Remove the following content: - data: config: | [global] ... auth cluster required = none auth service required = none auth client required = none auth supported = none 
 
- Remove all Rook deployments generated with Rook Operator: - kubectl -n rook-ceph delete deploy -l app=rook-ceph-mon kubectl -n rook-ceph delete deploy -l app=rook-ceph-mgr kubectl -n rook-ceph delete deploy -l app=rook-ceph-osd kubectl -n rook-ceph delete deploy -l app=rook-ceph-crashcollector 
- Start Ceph Controller by scaling its deployment replicas to - 1:- kubectl -n ceph-lcm-mirantis scale deployment ceph-controller --replicas 1 
- Start Rook Operator by scaling its deployment replicas to - 1:- kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1 
- Inspect the Rook Operator logs and wait until the orchestration has settled: - kubectl -n rook-ceph logs -l app=rook-ceph-operator 
- Verify that the pods in the - rook-cephnamespace have the- rook-ceph-mon-a,- rook-ceph-mgr-a, and all the auxiliary pods are up and running, and all- rook-ceph-osd-ID-xxxxxxgreater than zero are running:- kubectl -n rook-ceph get pod 
- Verify the Ceph state. The output must indicate that one - mon, one- mgr, and all Ceph OSDs are up and running and the overall stored data size equals to the old cluster data size.- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s 
- Edit the - MiraCephCR and add two more- monand- mgrroles to the corresponding nodes:- kubectl -n ceph-lcm-mirantis edit miraceph 
- Inspect the Rook namespace and wait until all Ceph Monitors are in the - Runningstate:- kubectl -n rook-ceph get pod -l app=rook-ceph-mon 
- Verify the Ceph state. The output must indicate that three - mon(three in quorum), one- mgr, and all Ceph OSDs are up and running and the overall stored data size equals to the old cluster data size.- kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s 
Once done, the data from the failed or removed Ceph cluster is restored and ready to use.