Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly Container Cloud). This means everything you need is in one place. Some legacy names may remain in the code and documentation and will be updated in future releases. The separate Container Cloud documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Ceph Monitors recovery¶
Warning
This procedure is valid for MOSK clusters that use the deprecated
KaaSCephCluster custom resource (CR) instead of the MiraCeph CR that is
available since MOSK 25.2 as a new Ceph configuration entrypoint. For the
equivalent procedure with the MiraCeph CR, refer to the following section:
This section describes how to recover failed Ceph Monitors of an existing Ceph cluster in the following state:
The Ceph cluster contains failed Ceph Monitors that cannot start and hang in the
ErrororCrashLoopBackOffstate.The logs of the failed Ceph Monitor pods contain the following lines:
mon.g does not exist in monmap, will attempt to join an existing cluster ... mon.g@-1(???) e11 not in monmap and have been in a quorum before; must have been removed mon.g@-1(???) e11 commit suicide!
The Ceph cluster contains at least one
RunningCeph Monitor and theceph -scommand outputs one healthymonand one healthymgrinstance.
Perform the following steps for all failed Ceph Monitors at a time if not stated otherwise.
To recover failed Ceph Monitors:
Obtain and export the
kubeconfigof the affected cluster.Scale the
rook-ceph/rook-ceph-operatordeployment down to0replicas:kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
Delete all failed Ceph Monitor deployments:
Identify the Ceph Monitor pods in the
ErrororCrashLookBackOffstate:kubectl -n rook-ceph get pod -l 'app in (rook-ceph-mon,rook-ceph-mon-canary)'
Verify that the affected pods contain the failure logs described above:
kubectl -n rook-ceph logs <failedMonPodName>
Substitute
<failedMonPodName>with the Ceph Monitor pod name. For example,rook-ceph-mon-g-845d44b9c6-fjc5d.Save the identifying letters of failed Ceph Monitors for further usage. For example,
f,e, and so on.Delete all corresponding deployments of these pods:
Identify the affected Ceph Monitor pod deployments:
kubectl -n rook-ceph get deploy -l 'app in (rook-ceph-mon,rook-ceph-mon-canary)'
Delete the affected Ceph Monitor pod deployments. For example, if the Ceph cluster has the
rook-ceph-mon-c-845d44b9c6-fjc5dpod in theCrashLoopBackOffstate, remove the correspondingrook-ceph-mon-c:kubectl -n rook-ceph delete deploy rook-ceph-mon-c
Canary
mondeployments have the suffix-canary.
Remove all corresponding entries of Ceph Monitors from the MON map:
Enter the
ceph-toolspod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \ app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
Inspect the current MON map and save the IP addresses of the failed Ceph monitors for further usage:
ceph mon dump
Remove all entries of failed Ceph Monitors using the previously saved letters:
ceph mon rm <monLetter>
Substitute
<monLetter>with the corresponding letter of a failed Ceph Monitor.Exit the
ceph-toolspod.
Remove all failed Ceph Monitors entries from the Rook
monendpoints ConfigMap:Open the
rook-ceph/rook-ceph-mon-endpointsConfigMap for editing:kubectl -n rook-ceph edit cm rook-ceph-mon-endpoints
Remove all entries of failed Ceph Monitors from the ConfigMap data and update the
maxMonIdvalue with the current number ofRunningCeph Monitors. For example,rook-ceph-mon-endpointshas the followingdata:data: csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["172.0.0.222:6789","172.0.0.223:6789","172.0.0.224:6789","172.16.52.217:6789","172.16.52.216:6789"]}]' data: a=172.0.0.222:6789,b=172.0.0.223:6789,c=172.0.0.224:6789,f=172.0.0.217:6789,e=172.0.0.216:6789 mapping: '{"node":{ "a":{"Name":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Hostname":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Address":"172.16.52.222"}, "b":{"Name":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Hostname":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Address":"172.16.52.223"}, "c":{"Name":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Hostname":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Address":"172.16.52.224"}, "e":{"Name":"kaas-node-ba3bfa17-77d2-467c-91eb-6291fb219a80","Hostname":"kaas-node-ba3bfa17-77d2-467c-91eb-6291fb219a80","Address":"172.16.52.216"}, "f":{"Name":"kaas-node-6f669490-f0c7-4d19-bf73-e51fbd6c7672","Hostname":"kaas-node-6f669490-f0c7-4d19-bf73-e51fbd6c7672","Address":"172.16.52.217"}} }' maxMonId: "5"
If
eandfare the letters of failed Ceph Monitors, the resulting ConfigMap data must be as follows:data: csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["172.0.0.222:6789","172.0.0.223:6789","172.0.0.224:6789"]}]' data: a=172.0.0.222:6789,b=172.0.0.223:6789,c=172.0.0.224:6789 mapping: '{"node":{ "a":{"Name":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Hostname":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Address":"172.16.52.222"}, "b":{"Name":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Hostname":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Address":"172.16.52.223"}, "c":{"Name":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Hostname":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Address":"172.16.52.224"}} }' maxMonId: "3"
Back up the data of the failed Ceph Monitors one by one:
SSH to the node of a failed Ceph Monitor using the previously saved IP address.
Move the Ceph Monitor data directory to another place:
mv /var/lib/rook/mon-<letter> /var/lib/rook/mon-<letter>.backup
Close the SSH connection.
Scale the
rook-ceph/rook-ceph-operatordeployment up to1replica:kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
Wait until all Ceph Monitors are in the
Runningstate:kubectl -n rook-ceph get pod -l app=rook-ceph-mon -w
Restore the data from the backup for each recovered Ceph Monitor one by one:
Enter a recovered Ceph Monitor pod:
kubectl -n rook-ceph exec -it <monPodName> bash
Substitute
<monPodName>with the recovered Ceph Monitor pod name. For example,rook-ceph-mon-g-845d44b9c6-fjc5d.Recover the
mondata backup for the current Ceph Monitor:ceph-monstore-tool /var/lib/rook/mon-<letter>.backup/data store-copy /var/lib/rook/mon-<letter>/data/
Substitute
<letter>with the current Ceph Monitor pod letter, for example,e.
Verify the Ceph state. The output must indicate the desired number of Ceph Monitors and all of them must be in quorum.
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s